[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Omaha.pm] 30m hack - survey log de-duper



Given a log like this:

2005-05-19 11:31:09|CRPTWR|D2|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4||
2005-05-19 11:31:57|CRPTWR|D6|2|1|1|4|3|3|3|3|4|3|4|4|4|4|4|3|1|1|1|4|3|4|1|4|4|3|2|1|4||
2005-05-19 11:32:31|CRPTWR|D3|4|4|3|4|4|4|3|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4||
2005-05-19 11:33:01|CRPTWR|D10|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4|4||

from an online survey, detect and remove any times that people may have double-clicked the submit button creating identical rows. If you find an identical row (ignoring the datetime stamp) logged within 10s of its identical counterpart, its a dupe. Remove it.

Solution:

#!/usr/bin/perl

use Date::Calc qw( Delta_DHMS );

open (IN, "aos.log");
open (OUTGOOD, ">aos.log.good");
open (OUTDUPE, ">aos.log.dupes");

while (<IN>) {
#   print;
   chomp;
   my $line = $_;
   my $key = $line;
   $key =~ s/.*?\|//;
   if ($keys{$key}) {
      #print "$keys{$key}\n";
      #print "$line\n";
      #print "   dupe!?";
      if (seconds_elapsed($keys{$key}, $line) > 10) {
         #print " no. There was at least 10s elapsed.\n";
         print OUTGOOD "$line\n";
      } else {
         #print " YES! This is a dupe!!\n";
         print OUTDUPE "$line\n";
      }
   } else {
      print OUTGOOD "$line\n";
   }
   $keys{$key} = $line;
}
close IN;

sub seconds_elapsed {
   my ($str1, $str2) = @_;
   $str1 =~ s/\|.*//;
   $str2 =~ s/\|.*//;
   my @delta = Delta_DHMS((split /\D/, $str1), (split /\D/, $str2));
   my $ret =
      $delta[0] * 24 * 60 * 60 +    # Days -> seconds
      $delta[1] * 60 * 60 +         # Hours -> seconds
      $delta[2] * 60 +              # Minutes -> seconds
      $delta[3];                    # Seconds
   # print "$str1 -> $str2 = $ret seconds elapsed\n";
   return $ret;
}

Cheers,

j