any chance regular expressions are cached?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • mh@pixar.com

    any chance regular expressions are cached?

    I've got a bit of code in a function like this:

    s=re.sub(r'\n', '\n'+spaces,s)
    s=re.sub(r'^',s paces,s)
    s=re.sub(r' *\n','\n',s)
    s=re.sub(r' *$','',s)
    s=re.sub(r'\n*$ ','',s)

    Is there any chance that these will be cached somewhere, and save
    me the trouble of having to declare some global re's if I don't
    want to have them recompiled on each function invocation?

    Many TIA!
    Mark

    --
    Mark Harrison
    Pixar Animation Studios
  • Tim Chase

    #2
    Re: any chance regular expressions are cached?

    s=re.sub(r'\n', '\n'+spaces,s)
    s=re.sub(r'^',s paces,s)
    s=re.sub(r' *\n','\n',s)
    s=re.sub(r' *$','',s)
    s=re.sub(r'\n*$ ','',s)
    >
    Is there any chance that these will be cached somewhere, and save
    me the trouble of having to declare some global re's if I don't
    want to have them recompiled on each function invocation?
    >>import this
    ....
    Explicit is better than implicit
    ....


    Sounds like what you want is to use the compile() call to compile
    once, and then use the resulting objects:

    re1 = re.compile(r'\n ')
    re2 = re.compile(r'^' )
    ...
    s = re1.sub('\n' + spaces, s)
    s = re2.sub(spaces, s)
    ...


    The compile() should be done once (outside loops, possibly at a
    module level, as, in a way, they're constants) and then you can
    use the resulting object without the overhead of compiling.

    -tkc



    Comment

    • Ryan Ginstrom

      #3
      RE: any chance regular expressions are cached?

      On Behalf Of Tim Chase
      Sounds like what you want is to use the compile() call to
      compile once, and then use the resulting objects:
      >
      re1 = re.compile(r'\n ')
      re2 = re.compile(r'^' )
      ...
      s = re1.sub('\n' + spaces, s)
      s = re2.sub(spaces, s)
      Yes. And I would go a step further and suggest that regular expressions are
      best avoided in favor of simpler things when possible. That will make the
      code easier to debug, and probably faster.

      A couple of examples:
      >>text = """spam spam spam
      spam spam


      spam

      spam"""
      >># normalize newlines
      >>print "\n".join([line for line in text.splitlines () if line])
      spam spam spam
      spam spam
      spam
      spam
      >># normalize whitespace
      >>print " ".join(text.spl it())
      spam spam spam spam spam spam spam
      >># strip leading/trailing space
      >>text = " spam "
      >>print text.lstrip()
      spam
      >>print text.rstrip()
      spam
      >>print text.strip()
      spam

      Regards,
      Ryan Ginstrom

      Comment

      • Terry Reedy

        #4
        Re: any chance regular expressions are cached?


        <mh@pixar.comwr ote in message
        news:bu%Aj.5528 $fX7.893@nlpi06 1.nbdc.sbc.com. ..
        | I've got a bit of code in a function like this:
        |
        | s=re.sub(r'\n', '\n'+spaces,s)
        | s=re.sub(r'^',s paces,s)
        | s=re.sub(r' *\n','\n',s)
        | s=re.sub(r' *$','',s)
        | s=re.sub(r'\n*$ ','',s)
        |
        | Is there any chance that these will be cached somewhere, and save
        | me the trouble of having to declare some global re's if I don't
        | want to have them recompiled on each function invocation?

        The last time I looked, several versions ago, re did cache.
        Don't know if still true. Not part of spec, I don't think.

        tjr



        Comment

        • Steven D'Aprano

          #5
          Re: any chance regular expressions are cached?

          On Mon, 10 Mar 2008 00:42:47 +0000, mh wrote:
          I've got a bit of code in a function like this:
          >
          s=re.sub(r'\n', '\n'+spaces,s)
          s=re.sub(r'^',s paces,s)
          s=re.sub(r' *\n','\n',s)
          s=re.sub(r' *$','',s)
          s=re.sub(r'\n*$ ','',s)
          >
          Is there any chance that these will be cached somewhere, and save me the
          trouble of having to declare some global re's if I don't want to have
          them recompiled on each function invocation?

          At the interactive interpreter, type "help(re)" [enter]. A page or two
          down, you will see:

          purge()
          Clear the regular expression cache


          and looking at the source code I see many calls to _compile() which
          starts off with:

          def _compile(*key):
          # internal: compile pattern
          cachekey = (type(key[0]),) + key
          p = _cache.get(cach ekey)
          if p is not None:
          return p

          So yes, the re module caches it's regular expressions.

          Having said that, at least four out of the five examples you give are
          good examples of when you SHOULDN'T use regexes.

          re.sub(r'\n','\ n'+spaces,s)

          is better written as s.replace('\n', '\n'+spaces). Don't believe me?
          Check this out:

          >>s = 'hello\nworld'
          >>spaces = " "
          >>from timeit import Timer
          >>Timer("re.sub ('\\n', '\\n'+spaces, s)",
          .... "import re;from __main__ import s, spaces").timeit ()
          7.4031901359558 105
          >>Timer("s.repl ace('\\n', '\\n'+spaces)",
          .... "import re;from __main__ import s, spaces").timeit ()
          1.6208670139312 744

          The regex is nearly five times slower than the simple string replacement.


          Similarly:

          re.sub(r'^',spa ces,s)

          is better written as spaces+s, which is nearly eleven times faster.

          Also:

          re.sub(r' *$','',s)
          re.sub(r'\n*$', '',s)

          are just slow ways of writing s.rstrip(' ') and s.rstrip('\n').



          --
          Steven

          Comment

          • John Machin

            #6
            Re: any chance regular expressions are cached?

            On Mar 10, 11:42 am, m...@pixar.com wrote:
            I've got a bit of code in a function like this:
            >
            s=re.sub(r'\n', '\n'+spaces,s)
            s=re.sub(r'^',s paces,s)
            s=re.sub(r' *\n','\n',s)
            s=re.sub(r' *$','',s)
            s=re.sub(r'\n*$ ','',s)
            >
            Is there any chance that these will be cached somewhere, and save
            me the trouble of having to declare some global re's if I don't
            want to have them recompiled on each function invocation?
            >
            Yes they will be cached. But do yourself a favour and check out the
            string methods.

            E.g.
            >>import re
            >>def opfunc(s, spaces):
            .... s=re.sub(r'\n', '\n'+spaces,s)
            .... s=re.sub(r'^',s paces,s)
            .... s=re.sub(r' *\n','\n',s)
            .... s=re.sub(r' *$','',s)
            .... s=re.sub(r'\n*$ ','',s)
            .... return s
            ....
            >>def myfunc(s, spaces):
            .... return '\n'.join(space s + x.rstrip() if x.rstrip() else '' for
            x in s.splitlines())
            ....
            >>t1 = 'foo\nbar\nzot\ n'
            >>t2 = 'foo\nbar \nzot\n'
            >>t3 = 'foo\n\nzot\n'
            >>[opfunc(s, ' ') for s in (t1, t2, t3)]
            [' foo\n bar\n zot', ' foo\n bar\n zot', ' foo\n\n
            zot']
            >>[myfunc(s, ' ') for s in (t1, t2, t3)]
            [' foo\n bar\n zot', ' foo\n bar\n zot', ' foo\n\n
            zot']
            >>>

            Comment

            • mh@pixar.com

              #7
              Re: any chance regular expressions are cached?

              John Machin <sjmachin@lexic on.netwrote:
              Yes they will be cached.
              great.
              But do yourself a favour and check out the
              string methods.
              Nifty... thanks all!

              --
              Mark Harrison
              Pixar Animation Studios

              Comment

              • John Machin

                #8
                Re: any chance regular expressions are cached?

                On Mar 10, 3:42 pm, John Machin <sjmac...@lexic on.netwrote rather
                baroquely:
                ...>>def myfunc(s, spaces):
                >
                ... return '\n'.join(space s + x.rstrip() if x.rstrip() else '' for
                x in s.splitlines())
                Better:
                .... return '\n'.join((spac es + x).rstrip() for x in
                s.splitlines())

                Comment

                • Arnaud Delobelle

                  #9
                  Re: any chance regular expressions are cached?

                  On Mar 10, 3:39 am, Steven D'Aprano <st...@REMOVE-THIS-
                  cybersource.com .auwrote:
                  [...]
                  Having said that, at least four out of the five examples you give are
                  good examples of when you SHOULDN'T use regexes.
                  >
                  re.sub(r'\n','\ n'+spaces,s)
                  >
                  is better written as s.replace('\n', '\n'+spaces). Don't believe me?
                  Check this out:
                  >
                  >s = 'hello\nworld'
                  >spaces = "   "
                  >from timeit import Timer
                  >Timer("re.sub( '\\n', '\\n'+spaces, s)",
                  >
                  ... "import re;from __main__ import s, spaces").timeit ()
                  7.4031901359558 105>>Timer("s.r eplace('\\n', '\\n'+spaces)",
                  >
                  ... "import re;from __main__ import s, spaces").timeit ()
                  1.6208670139312 744
                  >
                  The regex is nearly five times slower than the simple string replacement.
                  I agree that the second version is better, but most of the time in the
                  first one is spend compiling the regexp, so the comparison is not
                  really fair:
                  >>s = 'hello\nworld'
                  >>spaces = " "
                  >>import re
                  >>r = re.compile('\\n ')
                  >>from timeit import Timer
                  >>Timer("r.sub( '\\n'+spaces, s)", "from __main__ import r,spaces,s").ti meit()
                  1.7726190090179 443
                  >>Timer("s.repl ace('\\n', '\\n'+spaces)", "from __main__ import s, spaces").timeit ()
                  0.7673950195312 5
                  >>Timer("re.sub ('\\n', '\\n'+spaces, s)", "from __main__ import re, s, spaces").timeit ()
                  4.3669700622558 594
                  >>>
                  Regexps are still more than twice slower.

                  --
                  Arnaud

                  Comment

                  Working...