Separate regular expressions, or one more complex one?
April 03, 2009 [Performance, Python, Regular Expressions, Tech]I have asked myself this question several times, so I thought it was about time I did a test and found an answer.
If the user of your program can supply you with a list of regular expressions to match against some text, should you combine those expressions into one big one, or treat them separately?
In my case I need an OR relationship, so combining them just means putting a pipe symbol between them.*
So: one expression made by ORing, or looping through several - which is better? There's only one way to find out:
import re, sys line_with_match_foo = "This line contains foo." line_with_match_baz = "This line contains baz." line_without_match = "This line does not contain it." re_strings = ( "foo", "bar1", "bar2", "baz", "bar3", "bar4", ) piped_re = re.compile( "|".join( re_strings ) ) separate_res = list( re.compile( r ) for r in re_strings ) NUM_ITERATIONS = 1000000 def piped( line ): for i in range( NUM_ITERATIONS ): if piped_re.search( line ): print "match!" # do something def separate( line ): for i in range( NUM_ITERATIONS ): for s in separate_res: if s.search( line ): print "match!" # do something break # stop looping because we matched arg = sys.argv[1] if arg == "--piped-nomatch": piped( line_without_match ) elif arg == "--piped-match-begin": piped( line_with_match_foo ) elif arg == "--piped-match-middle": piped( line_with_match_baz ) elif arg == "--separate-nomatch": separate( line_without_match ) elif arg == "--separate-match-begin": separate( line_with_match_foo ) elif arg == "--separate-match-middle": separate( line_with_match_baz )
And here are the results:
$ time python re_timings.py --piped-nomatch > /dev/null real 0m0.987s user 0m0.943s sys 0m0.032s $ time python re_timings.py --separate-nomatch > /dev/null real 0m3.695s user 0m3.641s sys 0m0.037s
So when no regular expressions match, the combined expression is 3.6 times faster.
$ time python re_timings.py --piped-match-middle > /dev/null real 0m1.900s user 0m1.858s sys 0m0.033s $ time python re_timings.py --separate-match-middle > /dev/null real 0m3.543s user 0m3.439s sys 0m0.042s
And when an expression near the middle of the list matches, the combined expression is 1.8 times faster.
$ time python re_timings.py --piped-match-begin > /dev/null real 0m1.847s user 0m1.797s sys 0m0.035s $ time python re_timings.py --separate-match-begin > /dev/null real 0m1.649s user 0m1.597s sys 0m0.032s
But in the (presumably much rarer) case where all lines match the first expression in the list, the separate expressions are marginally faster.
A clear win for combing the expressions, unless you think it's likely that most lines will match expressions early in the list.
Note also if you combine the expressions the performance is similar when the matching expression is at different positions in the list (whereas in the other case list order matters a lot), so there is probably no need for you or your user to second-guess what order to put the expressions in, which makes life easier for everyone.
I would guess the results would be similar in other programming languages. I certainly found it to be similar in C# on .NET when I tried it a while ago.
By combining the expressions we ask the regular expression engine to do the heavy lifting for us, and it is specifically designed to be good at that job.
Open questions:
-
Have I made a mistake that makes these results invalid?
-
* Can arbitrary regular expressions be ORed together simply by concatenating them with a pipe symbol in between?
-
Can we do something similar if the problem requires us to AND expressions?