[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Omaha.pm] regex preference
2010/7/21 Jay Hannah <jhannah@omnihotels.com>
> if ( $in_string =~ /<HotelCode>$code/mgi || $in_string =~ /<mfResort>$code/mgi ) {
>
> I prefer
>
> if ( $in_string =~ /<(HotelCode|mfResort)>$code/mgi ) {
>From a readability standpoint, I 100% agree with you. I think the
combined regular expression is a lot more readable. The one note I
would add, which bit me recently, regards performance. If this is a
performance sensitive bit of code, and that regex will be going
through a lot of data, it's probably worth benchmarking it.
Background: A little while ago, I was writing a bit of perl to do some
basic log processing and when putting together some regular
expressions, I assumed that it would be faster to do a single regex
with alternation, as opposed to two separate checks. Very similar to
the code above. It turned out my assumption was very wrong.
A little assistance from the Benchmark module (thank you "Effective
Perl Programming, 2nd Ed." for kicking me in the ass to use that), and
I found out that Combining it was significantly slower. Here's the
code and results of my tests:
#!/usr/bin/perl
# vim: ts=3 sw=3 et sm ai smd sc bg=dark
#######################################################################
# Small script to benchmark regular expressions. Expects test text as
# standard input. Runs the test 100 times. The xN_ prefix is to
force test result
# ordering (sorts alphabetically by default).
#######################################################################
use strict;
use warnings;
use Benchmark qw(timethese);
my @data = <>;
my $host = "lab13";
print "Testing against " . scalar @data . " lines.\n";
timethese(
$ARGV[0] || 100,
{
x1_control => sub {
foreach (@data) {
if (1) {
next;
};
}
},
x2_mgi_separate => sub {
foreach (@data) {
my $foo = ( m/$host.*sudo/mgi || m/$host.*ssh/mgi );
}
},
x3_separate => sub {
foreach (@data) {
if ( my ($foo) = ( m/$host.*sudo/g || m/$host.*ssh/g ) ) {
next;
};
}
},
x4_mgi_combined => sub {
foreach (@data) {
if ( m/$host.*(?:sudo|ssh)/mgi ) {
next;
};
}
},
x5_combined => sub {
foreach (@data) {
if ( m/$host.*(?:sudo|ssh)/g ) {
next;
};
}
},
x6_mgi_combined_capture => sub {
foreach (@data) {
if ( m/$host.*(sudo|ssh)/mgi ) {
next;
};
}
},
x7_combined_capture => sub {
foreach (@data) {
if ( m/$host.*(sudo|ssh)/g ) {
next;
};
}
},
}
);
topher@nexus:~/perl/foo$ ./regex-benchmark.pl /tmp/regex-benchmark.data
Testing against 16789 lines.
Benchmark: timing 100 iterations of x1_control, x2_mgi_separate,
x3_separate, x4_mgi_combined, x5_combined, x6_mgi_combined_capture,
x7_combined_capture...
x1_control: 0 wallclock secs ( 0.32 usr + 0.00 sys = 0.32 CPU) @
312.50/s (n=100)
(warning: too few iterations for a reliable count)
x2_mgi_separate: 7 wallclock secs ( 6.41 usr + 0.00 sys = 6.41 CPU)
@ 15.60/s (n=100)
x3_separate: 2 wallclock secs ( 2.20 usr + 0.01 sys = 2.21 CPU) @
45.25/s (n=100)
x4_mgi_combined: 15 wallclock secs (14.87 usr + 0.04 sys = 14.91 CPU)
@ 6.71/s (n=100)
x5_combined: 3 wallclock secs ( 3.25 usr + 0.00 sys = 3.25 CPU) @
30.77/s (n=100)
x6_mgi_combined_capture: 18 wallclock secs (17.66 usr + 0.00 sys =
17.66 CPU) @ 5.66/s (n=100)
x7_combined_capture: 3 wallclock secs ( 3.48 usr + 0.00 sys = 3.48
CPU) @ 28.74/s (n=100)
As you can see, for this case and with this data, using separate regex
checks is over twice as fast as doing a combined regex with
alternation. The opposite of what I had expected. Case insensitive
searches are also significantly slower.
After this, I've discovered that I'm not as smart as I thought I was
with my assumptions about optimizing regular expressions. Now all
regular expressions that are going to be chewing on large data sets
get tested with a few alternatives to make sure I'm not screwing up
performance by being clever. Even little things can have a big
impact, especially with big data files.
> j
--
Christopher