String search vs regexp search

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Anand Pillai

    String search vs regexp search

    To search a word in a group of words, say a paragraph or a web page,
    would a string search or a regexp search be faster?

    The string search would of course be,

    if str.find(substr ) != -1:
    domything()

    And the regexp search assuming no case restriction would be,

    strre=re.compil e(substr, re.IGNORECASE)

    m=strre.search( str)
    if m:
    domything()

    I was about to do a test, then I thought someone here might have
    some data on this already.

    Thanks folks!

    -Anan
  • Duncan Booth

    #2
    Re: String search vs regexp search

    pythonguy@Hotpo p.com (Anand Pillai) wrote in
    news:84fc4588.0 310120655.ea95a f6@posting.goog le.com:
    [color=blue]
    > To search a word in a group of words, say a paragraph or a web page,
    > would a string search or a regexp search be faster?
    >
    > The string search would of course be,
    >
    > if str.find(substr ) != -1:
    > domything()
    >
    > And the regexp search assuming no case restriction would be,
    >
    > strre=re.compil e(substr, re.IGNORECASE)
    >
    > m=strre.search( str)
    > if m:
    > domything()
    >
    > I was about to do a test, then I thought someone here might have
    > some data on this already.
    >[/color]
    Yes. The answer is 'it all depends'.

    Things it depends on include:

    Your two bits of code do different things, one is case sensitive, one
    ignores case. Which did you need?

    How long is the string you are searching? How long is the substring?

    Is the substring the same every time, or are you always searching for
    different strings. Can the substring contain characters with special
    meanings for regular expressions?

    The regular expression code has a startup penalty since it has to compile
    the regular expression at least once, however the actual searching may be
    faster than the naive str.find. If the time spent doing the search is
    sufficiently long compared with the time doing the compile, the regular
    expression may win out.

    Bottom line: write the code so it is as clean and maintainable as possible.
    Only worry about optimising this if you have timed it and know that your
    searches are a bottleneck.

    --
    Duncan Booth duncan@rcp.co.u k
    int month(char *p){return(1248 64/((p[0]+p[1]-p[2]&0x1f)+1)%12 )["\5\x8\3"
    "\6\7\xb\1\x9\x a\2\0\4"];} // Who said my code was obscure?

    Comment

    • Anand Pillai

      #3
      Re: String search vs regexp search

      Sorry for being too brief!

      I was talking about a function which 'counts' the number
      of occurences using string & regexp.

      I wrote the code for the regexp search as well as the function
      search and tested it on a rather large file (800 KB) for
      occurences of a certain word. I find that the string search
      is at least 2 times faster than the one with regexp, excluding
      the time for the regexp.compile( ) method. This is particularly
      noticeable when the file becomes quite large and the word is
      spread out.

      I also thought the regexp would beat string thumbs down and I
      am suprised at the result that it is the other way around.

      Here is the code. Note that I am using the 'count' methods that
      count the number of occurences rather than the 'find' methods.

      # Test to find out whether string search in a data
      # is faster than regexp search.

      # Results: String search is much faster when it comes
      # to many occurences of the sub string.

      import time

      def strsearch1(s, substr):

      t1 = time.time()
      print 'Count 1 =>', s.count(substr)
      t2 = time.time()
      print 'Searching using string, Time taken => ', t2 - t1

      def strsearch2(s, substr):

      import re

      r=re.compile(su bstr, re.IGNORECASE)
      t1 = time.time()
      print 'Count 2 =>', len(r.findall(s ))
      t2 = time.time()
      print 'Searching using regexp, Time taken => ', t2 - t1


      data=open("test .html", "r").read()
      strsearch1(data , "Miriam")
      strsearch2(data , "Miriam")

      # Output here...

      D:\Programming\ python>python strsearch.py
      Count 1 => 45
      Searching using string, Time taken => 0.0599999427795
      Count 2 => 45
      Searching using regexp, Time taken => 0.110000014305

      Test was done on a windows 98 machine using Python 2.3, running
      on 248 MB RAM, Intel 1.7 GHz chipset.

      I was thinking of using regexp searches in my code, but this convinces
      me to stick on to the good old string search.

      Thanks for the replies.

      -Anand

      Duncan Booth <duncan@NOSPAMr cp.co.uk> wrote in message news:<Xns941360 C9B9445duncanrc pcouk@127.0.0.1 >...[color=blue]
      > pythonguy@Hotpo p.com (Anand Pillai) wrote in
      > news:84fc4588.0 310120655.ea95a f6@posting.goog le.com:
      >[color=green]
      > > To search a word in a group of words, say a paragraph or a web page,
      > > would a string search or a regexp search be faster?
      > >
      > > The string search would of course be,
      > >
      > > if str.find(substr ) != -1:
      > > domything()
      > >
      > > And the regexp search assuming no case restriction would be,
      > >
      > > strre=re.compil e(substr, re.IGNORECASE)
      > >
      > > m=strre.search( str)
      > > if m:
      > > domything()
      > >
      > > I was about to do a test, then I thought someone here might have
      > > some data on this already.
      > >[/color]
      > Yes. The answer is 'it all depends'.
      >
      > Things it depends on include:
      >
      > Your two bits of code do different things, one is case sensitive, one
      > ignores case. Which did you need?
      >
      > How long is the string you are searching? How long is the substring?
      >
      > Is the substring the same every time, or are you always searching for
      > different strings. Can the substring contain characters with special
      > meanings for regular expressions?
      >
      > The regular expression code has a startup penalty since it has to compile
      > the regular expression at least once, however the actual searching may be
      > faster than the naive str.find. If the time spent doing the search is
      > sufficiently long compared with the time doing the compile, the regular
      > expression may win out.
      >
      > Bottom line: write the code so it is as clean and maintainable as possible.
      > Only worry about optimising this if you have timed it and know that your
      > searches are a bottleneck.[/color]

      Comment

      • Jeremy Fincher

        #4
        Re: String search vs regexp search

        Duncan Booth <duncan@NOSPAMr cp.co.uk> wrote in message news:<Xns941360 C9B9445duncanrc pcouk@127.0.0.1 >...[color=blue]
        > The regular expression code has a startup penalty since it has to compile
        > the regular expression at least once, however the actual searching may be
        > faster than the naive str.find. If the time spent doing the search is
        > sufficiently long compared with the time doing the compile, the regular
        > expression may win out.[/color]

        Both regular expression searching and string.find will do searching
        one character at a time; given that, it seems impossible to me that
        the hand-coded-in-C "naive" string.find could be slower than the
        machine-translated-coded-in-Python regular expression search.
        Compilation time only serves to further increase string.find's
        advantage.

        Jeremy

        Comment

        • Dennis Reinhardt

          #5
          Re: String search vs regexp search

          > I find that the string search[color=blue]
          > is at least 2 times faster than the one with regexp
          > r=re.compile(su bstr, re.IGNORECASE)[/color]

          Off the top, maybe the regex is taking twice the time because it is doing
          twice the work looking for both the lower case character and the upper case
          character. It does not seem a fair test because the string find is case
          sensitive.
          --

          Dennis Reinhardt

          DennisR@dair.co m http://www.spamai.com?ng_py


          Comment

          • Duncan Booth

            #6
            Re: String search vs regexp search

            tweedgeezer@hot mail.com (Jeremy Fincher) wrote in
            news:698f09f8.0 310132019.4bd91 8b2@posting.goo gle.com:
            [color=blue]
            > Duncan Booth <duncan@NOSPAMr cp.co.uk> wrote in message
            > news:<Xns941360 C9B9445duncanrc pcouk@127.0.0.1 >...[color=green]
            >> The regular expression code has a startup penalty since it has to
            >> compile the regular expression at least once, however the actual
            >> searching may be faster than the naive str.find. If the time spent
            >> doing the search is sufficiently long compared with the time doing
            >> the compile, the regular expression may win out.[/color]
            >
            > Both regular expression searching and string.find will do searching
            > one character at a time; given that, it seems impossible to me that
            > the hand-coded-in-C "naive" string.find could be slower than the
            > machine-translated-coded-in-Python regular expression search.
            > Compilation time only serves to further increase string.find's
            > advantage.
            >[/color]
            I may have misremembered, but I thought there was a thread discussing this
            a little while back which claimed that the regular expression library
            looked for constant strings at the start of the regex, and if it found one
            used Boyer-Moore to do the search. If it does, then regular expressions
            searching for a constant string certainly ought to be much faster than a
            plain string.find (as the length of the searched string tends towards
            infinity).

            If it doesn't, then it should.

            --
            Duncan Booth duncan@rcp.co.u k
            int month(char *p){return(1248 64/((p[0]+p[1]-p[2]&0x1f)+1)%12 )["\5\x8\3"
            "\6\7\xb\1\x9\x a\2\0\4"];} // Who said my code was obscure?

            Comment

            • Duncan Booth

              #7
              Re: String search vs regexp search

              Duncan Booth <duncan@NOSPAMr cp.co.uk> wrote in
              news:Xns9414603 182031duncanrcp couk@127.0.0.1:
              [color=blue][color=green]
              >> Both regular expression searching and string.find will do searching
              >> one character at a time; given that, it seems impossible to me that
              >> the hand-coded-in-C "naive" string.find could be slower than the
              >> machine-translated-coded-in-Python regular expression search.
              >> Compilation time only serves to further increase string.find's
              >> advantage.
              >>[/color]
              > I may have misremembered, but I thought there was a thread discussing
              > this a little while back which claimed that the regular expression
              > library looked for constant strings at the start of the regex, and if
              > it found one used Boyer-Moore to do the search. If it does, then
              > regular expressions searching for a constant string certainly ought to
              > be much faster than a plain string.find (as the length of the searched
              > string tends towards infinity).
              >
              > If it doesn't, then it should.[/color]

              Ok, found the code. Regular expression searches do indeed use a form of
              Boyer-Moore, but not if you are ignoring case. So by specifying
              re.IGNORECASE the OP got a double hit, not only does the code have to do
              case insensitive comparisons, but it also has to crawl along looking at
              every character in the search string instead of skipping most of them.

              --
              Duncan Booth duncan@rcp.co.u k
              int month(char *p){return(1248 64/((p[0]+p[1]-p[2]&0x1f)+1)%12 )["\5\x8\3"
              "\6\7\xb\1\x9\x a\2\0\4"];} // Who said my code was obscure?

              Comment

              • Jeremy Fincher

                #8
                Re: String search vs regexp search

                Duncan Booth <duncan@NOSPAMr cp.co.uk> wrote in message news:<Xns941462 CDB5003duncanrc pcouk@127.0.0.1 >...[color=blue]
                > Ok, found the code. Regular expression searches do indeed use a form of
                > Boyer-Moore, but not if you are ignoring case. So by specifying
                > re.IGNORECASE the OP got a double hit, not only does the code have to do
                > case insensitive comparisons, but it also has to crawl along looking at
                > every character in the search string instead of skipping most of them.[/color]

                That's cool! Where'd you find the code?

                Jeremy

                Comment

                • Alex Martelli

                  #9
                  Re: String search vs regexp search

                  Jeremy Fincher wrote:
                  [color=blue]
                  > Duncan Booth <duncan@NOSPAMr cp.co.uk> wrote in message
                  > news:<Xns941462 CDB5003duncanrc pcouk@127.0.0.1 >...[color=green]
                  >> Ok, found the code. Regular expression searches do indeed use a form of
                  >> Boyer-Moore, but not if you are ignoring case. So by specifying
                  >> re.IGNORECASE the OP got a double hit, not only does the code have to do
                  >> case insensitive comparisons, but it also has to crawl along looking at
                  >> every character in the search string instead of skipping most of them.[/color]
                  >
                  > That's cool! Where'd you find the code?[/color]

                  Hmmm, dist/src/Modules/_sre.c in the Python CVS tree? Or Modules/_sre.c in
                  a standard source distribution?


                  Alex

                  Comment

                  • Duncan Booth

                    #10
                    Re: String search vs regexp search

                    Alex Martelli <aleax@aleax.it > wrote in
                    news:pCVib.2827 14$R32.9272822@ news2.tin.it:
                    [color=blue]
                    > Jeremy Fincher wrote:
                    >[color=green]
                    >> Duncan Booth <duncan@NOSPAMr cp.co.uk> wrote in message
                    >> news:<Xns941462 CDB5003duncanrc pcouk@127.0.0.1 >...[color=darkred]
                    >>> Ok, found the code. Regular expression searches do indeed use a form
                    >>> of Boyer-Moore, but not if you are ignoring case. So by specifying
                    >>> re.IGNORECASE the OP got a double hit, not only does the code have
                    >>> to do case insensitive comparisons, but it also has to crawl along
                    >>> looking at every character in the search string instead of skipping
                    >>> most of them.[/color]
                    >>
                    >> That's cool! Where'd you find the code?[/color]
                    >
                    > Hmmm, dist/src/Modules/_sre.c in the Python CVS tree? Or
                    > Modules/_sre.c in a standard source distribution?
                    >[/color]
                    Close, but it is actually a little bit complicated. _sre.c has the code
                    that does the search, including the skipping forward using an overlap
                    table, but the bit which checks for a literal prefix and builds the overlap
                    table is in lib/src_compile.py. So its in the Python code you have to look
                    to find that it ignores literal prefixes when ignoring case.

                    --
                    Duncan Booth duncan@rcp.co.u k
                    int month(char *p){return(1248 64/((p[0]+p[1]-p[2]&0x1f)+1)%12 )["\5\x8\3"
                    "\6\7\xb\1\x9\x a\2\0\4"];} // Who said my code was obscure?

                    Comment

                    • Anand Pillai

                      #11
                      Re: String search vs regexp search

                      Small point. If you notice the code, compilation time is
                      not included in the actual time calculated for regexp.

                      -Anand

                      tweedgeezer@hot mail.com (Jeremy Fincher) wrote in message news:<698f09f8. 0310132019.4bd9 18b2@posting.go ogle.com>...[color=blue]
                      > Duncan Booth <duncan@NOSPAMr cp.co.uk> wrote in message news:<Xns941360 C9B9445duncanrc pcouk@127.0.0.1 >...[color=green]
                      > > The regular expression code has a startup penalty since it has to compile
                      > > the regular expression at least once, however the actual searching may be
                      > > faster than the naive str.find. If the time spent doing the search is
                      > > sufficiently long compared with the time doing the compile, the regular
                      > > expression may win out.[/color]
                      >
                      > Both regular expression searching and string.find will do searching
                      > one character at a time; given that, it seems impossible to me that
                      > the hand-coded-in-C "naive" string.find could be slower than the
                      > machine-translated-coded-in-Python regular expression search.
                      > Compilation time only serves to further increase string.find's
                      > advantage.
                      >
                      > Jeremy[/color]

                      Comment

                      Working...