[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Omaha.pm] 30m hack - survey log de-duper
Given a log like this:
2005-05-19 11:31:09|CRPTWR|D2|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4||
2005-05-19 11:31:57|CRPTWR|D6|2|1|1|4|3|3|3|3|4|3|4|4|4|4|4|3|1|1|1|4|3|4|1|4|4|3|2|1|4||
2005-05-19 11:32:31|CRPTWR|D3|4|4|3|4|4|4|3|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4||
2005-05-19 11:33:01|CRPTWR|D10|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4||
from an online survey, detect and remove any times that people may have double-clicked the submit button creating identical rows. If you find an identical row (ignoring the datetime stamp) logged within 10s of its identical counterpart, its a dupe. Remove it.
Solution:
#!/usr/bin/perl
use Date::Calc qw( Delta_DHMS );
open (IN, "aos.log");
open (OUTGOOD, ">aos.log.good");
open (OUTDUPE, ">aos.log.dupes");
while (<IN>) {
# print;
chomp;
my $line = $_;
my $key = $line;
$key =~ s/.*?\|//;
if ($keys{$key}) {
#print "$keys{$key}\n";
#print "$line\n";
#print " dupe!?";
if (seconds_elapsed($keys{$key}, $line) > 10) {
#print " no. There was at least 10s elapsed.\n";
print OUTGOOD "$line\n";
} else {
#print " YES! This is a dupe!!\n";
print OUTDUPE "$line\n";
}
} else {
print OUTGOOD "$line\n";
}
$keys{$key} = $line;
}
close IN;
sub seconds_elapsed {
my ($str1, $str2) = @_;
$str1 =~ s/\|.*//;
$str2 =~ s/\|.*//;
my @delta = Delta_DHMS((split /\D/, $str1), (split /\D/, $str2));
my $ret =
$delta[0] * 24 * 60 * 60 + # Days -> seconds
$delta[1] * 60 * 60 + # Hours -> seconds
$delta[2] * 60 + # Minutes -> seconds
$delta[3]; # Seconds
# print "$str1 -> $str2 = $ret seconds elapsed\n";
return $ret;
}
Cheers,
j