catastrophic regexp, help!

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • cirfu

    catastrophic regexp, help!

    pat = re.compile("(\w * *)*")
    this matches all sentences.
    if fed the string "are you crazy? i am" it will return "are you
    crazy".

    i want to find a in a big string a sentence containing Zlatan
    Ibrahimovic and some other text.
    ie return the first sentence containing the name Zlatan Ibrahimovic.


    patzln = re.compile("(\w * *)* zlatan ibrahimovic (\w* *)*")
    should do this according to regexcoach but it seems to send my
    computer into 100%CPU-power and not closable.

  • Maric Michaud

    #2
    Re: catastrophic regexp, help!

    Le Wednesday 11 June 2008 06:20:14 cirfu, vous avez écrit :
    pat = re.compile("(\w * *)*")
    this matches all sentences.
    if fed the string "are you crazy? i am" it will return "are you
    crazy".
    >
    i want to find a in a big string a sentence containing Zlatan
    Ibrahimovic and some other text.
    ie return the first sentence containing the name Zlatan Ibrahimovic.
    >
    >
    patzln = re.compile("(\w * *)* zlatan ibrahimovic (\w* *)*")
    should do this according to regexcoach but it seems to send my
    computer into 100%CPU-power and not closable.
    This kind of regexp are quite often harmfull, while perfectly valid, if you
    take the time it will return, this check too many things to be practical.

    Read it, sequentially to make it sensible : for each sequence of word + space,
    trying with the longest first, does the string 'zlatan' follow ?

    "this is zlatan example.'
    compare with 'this is zlatan example', 'z'=='.', false
    compare with 'this is zlatan ', 'z'=='e', false
    compare with 'this is zlatan', 'z'==' ', false
    compare with 'this is ', "zlatan"=="zlat an", true
    compare with 'this is', 'z'==' ', false
    compare with 'this ', 'z'=='i', false
    compare with 'this', 'z'==' ', false
    ...

    ouch !

    The most simple are your regex, better they are, two short regex are better
    then one big, etc...
    Don't do premature optimization (especially with regexp).

    In [161]: s="""pat = re.compile("(\w * *)*")
    this matches all sentences.
    if fed the string "are you crazy? i am" it will return "are you
    crazy".
    i want to find a in a big string a sentence containing Zlatan
    Ibrahimovic and some other text.
    ie return the first sentence containing the name Zlatan Ibrahimovic.
    patzln = re.compile("(\w * *)* zlatan ibrahimovic (\w* *)*")
    should do this according to regexcoach but it seems to send my
    computer into 100%CPU-power and not closable.
    """

    In [172]: list(e[0] for e in re.findall("((\ w+\s*)+)", s, re.M) if
    re.findall('zla tan\s+ibrahimov ic', e[0], re.I))
    Out[172]:
    ['i want to find a in a big string a sentence containing Zlatan\nIbrahim ovic
    and some other text',
    'ie return the first sentence containing the name Zlatan Ibrahimovic',
    'zlatan ibrahimovic ']



    --
    _____________

    Maric Michaud

    Comment

    • Maric Michaud

      #3
      Re: catastrophic regexp, help!

      Le Wednesday 11 June 2008 09:08:53 Maric Michaud, vous avez écrit :
      "this is zlatan example.'
      compare with 'this is zlatan example', 'z'=='.', false
      compare with 'this is zlatan ', 'z'=='e', false
      compare with 'this is zlatan', 'z'==' ', false
      compare with 'this is ', "zlatan"=="zlat an", true
      Ah no ! it stops here, but would have continued on the entire string upto the
      empty string if it doesn't contain zlatan at all.
      compare with 'this is', 'z'==' ', false
      compare with 'this ', 'z'=='i', false
      compare with 'this', 'z'==' ', false


      --
      _____________

      Maric Michaud

      Comment

      • Chris

        #4
        Re: catastrophic regexp, help!

        On Jun 11, 6:20 am, cirfu <circularf...@y ahoo.sewrote:
        pat = re.compile("(\w * *)*")
        this matches all sentences.
        if fed the string "are you crazy? i am" it will return "are you
        crazy".
        >
        i want to find a in a big string a sentence containing Zlatan
        Ibrahimovic and some other text.
        ie return the first sentence containing the name Zlatan Ibrahimovic.
        >
        patzln = re.compile("(\w * *)* zlatan ibrahimovic (\w* *)*")
        should do this according to regexcoach but it seems to send my
        computer into 100%CPU-power and not closable.
        Maybe something like this would be of use...

        def sentence_locato r(s, sub):
        cnt = s.upper().count (sub.upper())
        if not cnt:
        return None
        tmp = []
        idx = -1
        while cnt:
        idx = s.upper().find( sub.upper(), (idx+1))
        a = -1
        while True:
        b = s.find('.', (a+1), idx)
        if b == -1:
        b = s.find('.', idx)
        if b == -1:
        tmp.append(s[a+1:])
        break
        tmp.append(s[a+1:b+1])
        break
        a = b
        cnt -= 1
        return tmp

        Comment

        • TheSaint

          #5
          Re: catastrophic regexp, help!

          On 12:20, mercoledì 11 giugno 2008 cirfu wrote:
          patzln = re.compile("(\w * *)* zlatan ibrahimovic (\w* *)*")
          I think that I shouldn't put anything around the phrase you want to find.

          patzln = re.compile(r'.* (zlatan ibrahimovic){1, 1}.*')

          this should do it for you. Unless searching into a special position.

          In the other hand, I'd like to understand how I can substitute a variable
          inside a pattern.

          if I do:
          import os, re
          EOL= os.linesep

          re_EOL= re.compile(r'[?P<EOL>\s+2\t]'))

          for line in open('myfile',' r').readlines() :
          print re_EOL.sub('',l ine)

          Will it remove tabs, spaces and end-of-line ?
          It's doing but no EOL :(

          --
          Mailsweeper Home : http://it.geocities.com/call_me_not_now/index.html

          Comment

          • cirfu

            #6
            Re: catastrophic regexp, help!

            On 11 Juni, 17:04, TheSaint <fc14301...@icq mail.comwrote:
            On 12:20, mercoledì 11 giugno 2008 cirfu wrote:
            >
            patzln = re.compile("(\w * *)* zlatan ibrahimovic (\w* *)*")
            >
            I think that I shouldn't put anything around the phrase you want to find.
            >
            patzln = re.compile(r'.* (zlatan ibrahimovic){1, 1}.*')
            >
            this should do it for you. Unless searching into a special position.
            >
            In the other hand, I'd like to understand how I can substitute a variable
            inside a pattern.
            >
            if I do:
            import os, re
            EOL= os.linesep
            >
            re_EOL= re.compile(r'[?P<EOL>\s+2\t]'))
            >
            for line in open('myfile',' r').readlines() :
            print re_EOL.sub('',l ine)
            >
            Will it remove tabs, spaces and end-of-line ?
            It's doing but no EOL :(
            >
            --
            Mailsweeper Home :http://it.geocities.com/call_me_not_now/index.html


            it returns all the sentences. i just want the one containing zlatan
            ibrahimovic.

            Comment

            • cirfu

              #7
              Re: catastrophic regexp, help!

              On 11 Juni, 10:25, Chris <cwi...@gmail.c omwrote:
              On Jun 11, 6:20 am, cirfu <circularf...@y ahoo.sewrote:
              >
              pat = re.compile("(\w * *)*")
              this matches all sentences.
              if fed the string "are you crazy? i am" it will return "are you
              crazy".
              >
              i want to find a in a big string a sentence containing Zlatan
              Ibrahimovic and some other text.
              ie return the first sentence containing the name Zlatan Ibrahimovic.
              >
              patzln = re.compile("(\w * *)* zlatan ibrahimovic (\w* *)*")
              should do this according to regexcoach but it seems to send my
              computer into 100%CPU-power and not closable.
              >
              Maybe something like this would be of use...
              >
              def sentence_locato r(s, sub):
              cnt = s.upper().count (sub.upper())
              if not cnt:
              return None
              tmp = []
              idx = -1
              while cnt:
              idx = s.upper().find( sub.upper(), (idx+1))
              a = -1
              while True:
              b = s.find('.', (a+1), idx)
              if b == -1:
              b = s.find('.', idx)
              if b == -1:
              tmp.append(s[a+1:])
              break
              tmp.append(s[a+1:b+1])
              break
              a = b
              cnt -= 1
              return tmp

              yes, seems very unpythonic though :)
              must be a simpler way that isnt slow as hell.

              Comment

              • alfasub000@gmail.com

                #8
                Re: catastrophic regexp, help!

                On Jun 11, 11:07 pm, cirfu <circularf...@y ahoo.sewrote:
                On 11 Juni, 10:25, Chris <cwi...@gmail.c omwrote:
                >
                >
                >
                On Jun 11, 6:20 am, cirfu <circularf...@y ahoo.sewrote:
                >
                pat = re.compile("(\w * *)*")
                this matches all sentences.
                if fed the string "are you crazy? i am" it will return "are you
                crazy".
                >
                i want to find a in a big string a sentence containing Zlatan
                Ibrahimovic and some other text.
                ie return the first sentence containing the name Zlatan Ibrahimovic.
                >
                patzln = re.compile("(\w * *)* zlatan ibrahimovic (\w* *)*")
                should do this according to regexcoach but it seems to send my
                computer into 100%CPU-power and not closable.
                >
                Maybe something like this would be of use...
                >
                def sentence_locato r(s, sub):
                cnt = s.upper().count (sub.upper())
                if not cnt:
                return None
                tmp = []
                idx = -1
                while cnt:
                idx = s.upper().find( sub.upper(), (idx+1))
                a = -1
                while True:
                b = s.find('.', (a+1), idx)
                if b == -1:
                b = s.find('.', idx)
                if b == -1:
                tmp.append(s[a+1:])
                break
                tmp.append(s[a+1:b+1])
                break
                a = b
                cnt -= 1
                return tmp
                >
                yes, seems very unpythonic though :)
                must be a simpler way that isnt slow as hell.
                Why wouldn't you use character classes instead of groups? i.e:

                pat = re.compile(r'([ \w]*Zlatan Ibrahimivoc[ \w]*)')
                sentence = re.match(text). groups()

                As has been mentioned earlier, certain evil combinations of regular
                expressions and groups will cause python's regular expression engine
                to go (righteously) crazy as they require the internal state machine
                to branch out exponentially.

                Comment

                Working...