[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Omaha.pm] regex preference

To: "Perl Mongers of Omaha, Nebraska USA" <omaha-pm@pm.org>
Subject: Re: [Omaha.pm] regex preference
From: Christopher Cashell <topher-pm@zyp.org>
Date: Thu, 5 Aug 2010 16:59:50 -0500
Delivered-to: mailman-omaha-pm@mailman.pm.dev
Delivered-to: omaha-pm@pm.org
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:mime-version:sender:received :in-reply-to:references:from:date:x-google-sender-auth:message-id :subject:to:content-type:content-transfer-encoding; bh=2e724b05i1tpX7NYapsk5VYQUxv8b+uXc0K7K1kBr2I=; b=Bp9q0ERBJM2IJpK3h0hCsH7kJVCBpXeMXD6yeaM8SdGpShVGsrGbexZO1zYXVpGkpS hU/npuKDUjh8TM0cF4qXvnG9C0MZefutMA+GARO0Ud8lmsSeXwyDjSl5ZWIcdPaxP5BE q1h3lfkXLeswN0pk/y1YfMQqBBjJjUUX9kuIk=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:from:date :x-google-sender-auth:message-id:subject:to:content-type :content-transfer-encoding; b=Ys5VoJh1R0MBU+KzCCRA6H0q0y65Xomd/8nk6f0itDDzhxgnJjIJSUYAW+BthAsyWB 84+7Qe4aOtMS5+wDXIjJ2AejyXmCCXj/JYdLkdSyJ9QoC1XQ869vTRqEXvhOUwuE9vvQ zSTVkBIAZDksG8sHq5aPp6JNTSbFhsJ9phVQM=
In-reply-to: <396CEDAA86B38646ACE2FEAA22C3FBF103A2C571@l3exchange.omnihotels.net>
List-archive: <http://mail.pm.org/pipermail/omaha-pm>
List-help: <mailto:omaha-pm-request@pm.org?subject=help>
List-id: "Perl Mongers of Omaha, Nebraska USA" <omaha-pm.pm.org>
List-post: <mailto:omaha-pm@pm.org>
List-subscribe: <http://mail.pm.org/mailman/listinfo/omaha-pm>, <mailto:omaha-pm-request@pm.org?subject=subscribe>
List-unsubscribe: <http://mail.pm.org/mailman/options/omaha-pm>, <mailto:omaha-pm-request@pm.org?subject=unsubscribe>
References: <201007211621.o6LGLB7d032525@omares-etl.omnihotels.com> <396CEDAA86B38646ACE2FEAA22C3FBF103A2C571@l3exchange.omnihotels.net>
Reply-to: "Perl Mongers of Omaha, Nebraska USA" <omaha-pm@pm.org>
Sender: christopher.cashell@gmail.com

2010/7/21 Jay Hannah <jhannah@omnihotels.com>
>    if ( $in_string =~ /<HotelCode>$code/mgi || $in_string =~ /<mfResort>$code/mgi ) {
>
> I prefer
>
>    if ( $in_string =~ /<(HotelCode|mfResort)>$code/mgi ) {

>From a readability standpoint, I 100% agree with you.  I think the
combined regular expression is a lot more readable.  The one note I
would add, which bit me recently, regards performance.  If this is a
performance sensitive bit of code, and that regex will be going
through a lot of data, it's probably worth benchmarking it.

Background: A little while ago, I was writing a bit of perl to do some
basic log processing and when putting together some regular
expressions, I assumed that it would be faster to do a single regex
with alternation, as opposed to two separate checks.  Very similar to
the code above.  It turned out my assumption was very wrong.

A little assistance from the Benchmark module (thank you "Effective
Perl Programming, 2nd Ed." for kicking me in the ass to use that), and
I found out that Combining it was significantly slower.  Here's the
code and results of my tests:

#!/usr/bin/perl
# vim: ts=3 sw=3 et sm ai smd sc bg=dark
#######################################################################
# Small script to benchmark regular expressions.  Expects test text as
# standard input.  Runs the test 100 times.  The xN_ prefix is to
force test result
# ordering (sorts alphabetically by default).
#######################################################################
use strict;
use warnings;
use Benchmark qw(timethese);

my @data = <>;
my $host = "lab13";

print "Testing against " . scalar @data . " lines.\n";

timethese(
   $ARGV[0] || 100,
   {
      x1_control => sub {
         foreach (@data) {
            if (1) {
               next;
            };
         }
      },

      x2_mgi_separate => sub {
         foreach (@data) {
            my $foo = ( m/$host.*sudo/mgi || m/$host.*ssh/mgi );
         }
      },

      x3_separate => sub {
         foreach (@data) {
            if ( my ($foo) = ( m/$host.*sudo/g || m/$host.*ssh/g ) ) {
               next;
            };
         }
      },

      x4_mgi_combined => sub {
         foreach (@data) {
            if ( m/$host.*(?:sudo|ssh)/mgi ) {
               next;
            };
         }
      },

      x5_combined => sub {
         foreach (@data) {
            if ( m/$host.*(?:sudo|ssh)/g ) {
               next;
            };
         }
      },

      x6_mgi_combined_capture => sub {
         foreach (@data) {
            if ( m/$host.*(sudo|ssh)/mgi ) {
               next;
            };
         }
      },

      x7_combined_capture => sub {
         foreach (@data) {
            if ( m/$host.*(sudo|ssh)/g ) {
               next;
            };
         }
      },

   }
);


topher@nexus:~/perl/foo$ ./regex-benchmark.pl /tmp/regex-benchmark.data
Testing against 16789 lines.
Benchmark: timing 100 iterations of x1_control, x2_mgi_separate,
x3_separate, x4_mgi_combined, x5_combined, x6_mgi_combined_capture,
x7_combined_capture...

x1_control:  0 wallclock secs ( 0.32 usr +  0.00 sys =  0.32 CPU) @
312.50/s (n=100)
            (warning: too few iterations for a reliable count)
x2_mgi_separate:  7 wallclock secs ( 6.41 usr +  0.00 sys =  6.41 CPU)
@ 15.60/s (n=100)
x3_separate:  2 wallclock secs ( 2.20 usr +  0.01 sys =  2.21 CPU) @
45.25/s (n=100)
x4_mgi_combined: 15 wallclock secs (14.87 usr +  0.04 sys = 14.91 CPU)
@  6.71/s (n=100)
x5_combined:  3 wallclock secs ( 3.25 usr +  0.00 sys =  3.25 CPU) @
30.77/s (n=100)
x6_mgi_combined_capture: 18 wallclock secs (17.66 usr +  0.00 sys =
17.66 CPU) @  5.66/s (n=100)
x7_combined_capture:  3 wallclock secs ( 3.48 usr +  0.00 sys =  3.48
CPU) @ 28.74/s (n=100)

As you can see, for this case and with this data, using separate regex
checks is over twice as fast as doing a combined regex with
alternation.  The opposite of what I had expected.  Case insensitive
searches are also significantly slower.

After this, I've discovered that I'm not as smart as I thought I was
with my assumptions about optimizing regular expressions.  Now all
regular expressions that are going to be chewing on large data sets
get tested with a few alternatives to make sure I'm not screwing up
performance by being clever.  Even little things can have a big
impact, especially with big data files.

> j

--
Christopher

References:
- [Omaha.pm] regex preference
  - From: "Jay Hannah" <jhannah@omnihotels.com>

Prev by Date: Re: [Omaha.pm] 3rd Catalyst book is available
Next by Date: [Omaha.pm] FW: why is $# used in this section of code?
Previous by thread: [Omaha.pm] regex preference
Next by thread: [Omaha.pm] Fwd: Are the people in your PM group IRC savvy?
Index(es):
- Date
- Thread