using re module to find " but not " alone ... is this a BUG in re?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • anton

    using re module to find " but not " alone ... is this a BUG in re?

    Hi,

    I want to replace all occourences of " by \" in a string.

    But I want to leave all occourences of \" as they are.

    The following should happen:

    this I want " while I dont want this \"

    should be transformed to:

    this I want \" while I dont want this \"

    and NOT:

    this I want \" while I dont want this \\"

    I tried even the (?<=...) construction but here I get an unbalanced paranthesis
    error.

    It seems tha re is not able to do the job due to parsing/compiling problems
    for this sort of strings.


    Have you any idea??

    Anton


    Example: --------------------

    import re

    re.findall("[^\\]\"","this I want \" while I dont want this \\\" ")

    Traceback (most recent call last):
    File "<interacti ve input>", line 1, in <module>
    File "C:\Python25\li b\re.py", line 175, in findall
    return _compile(patter n, flags).findall( string)
    File "C:\Python25\li b\re.py", line 241, in _compile
    raise error, v # invalid expression
    error: unexpected end of regular expression

  • John Machin

    #2
    Re: using re module to find &quot; but not &quot; alone ... is this a BUG inre?

    On Jun 12, 7:11 pm, anton <anto...@gmx.de wrote:
    Hi,
    >
    I want to replace all occourences of " by \" in a string.
    >
    But I want to leave all occourences of \" as they are.
    >
    The following should happen:
    >
    this I want " while I dont want this \"
    >
    should be transformed to:
    >
    this I want \" while I dont want this \"
    >
    and NOT:
    >
    this I want \" while I dont want this \\"
    >
    I tried even the (?<=...) construction but here I get an unbalanced paranthesis
    error.
    Sounds like a deficit of backslashes causing re to regard \) as plain
    text and not the magic closing parenthesis in (?<=...) -- and don't
    you want (?<!...) ?
    >
    It seems tha re is not able to do the job due to parsing/compiling problems
    for this sort of strings.
    Nothing is ever as it seems.
    >
    Have you any idea??
    For a start, *ALWAYS* use a raw string for an re pattern -- halves the
    backslash pollution!

    >
    >
    re.findall("[^\\]\"","this I want \" while I dont want this \\\" ")
    and if you have " in the pattern, use '...' to enclose the pattern so
    that you don't have to use \"
    >
    Traceback (most recent call last):
    File "<interacti ve input>", line 1, in <module>
    File "C:\Python25\li b\re.py", line 175, in findall
    return _compile(patter n, flags).findall( string)
    File "C:\Python25\li b\re.py", line 241, in _compile
    raise error, v # invalid expression
    error: unexpected end of regular expression
    As expected.

    What you want is:
    >import re
    >text = r'frob this " avoid this \", OK?'
    >>text
    'frob this " avoid this \\", OK?'
    >re.sub(r'(?<!\ \)"', r'\"', text)
    frob this \\" avoid this \\", OK?'
    >>
    HTH,
    John

    Comment

    • Duncan Booth

      #3
      Re: using re module to find &quot; but not &quot; alone ... is this a BUG in re?

      John Machin <sjmachin@lexic on.netwrote:
      What you want is:
      >
      >>import re
      >>text = r'frob this " avoid this \", OK?'
      >>>text
      'frob this " avoid this \\", OK?'
      >>re.sub(r'(?<! \\)"', r'\"', text)
      frob this \\" avoid this \\", OK?'
      >>>
      >
      Or you can do it without using regular expressions at all. Just replace
      them all and then fix up the result:
      >>text = r'frob this " avoid this \", OK?'
      >>text.replace( '"', r'\"').replace( r'\\"', r'\"')
      'frob this \\" avoid this \\", OK?'


      --
      Duncan Booth http://kupuguy.blogspot.com

      Comment

      • Peter Otten

        #4
        Re: using re module to find &quot; but not &quot; alone ... is this a BUG in re?

        anton wrote:
        I want to replace all occourences of " by \" in a string.
        >
        But I want to leave all occourences of \" as they are.
        >
        The following should happen:
        >
        this I want " while I dont want this \"
        >
        should be transformed to:
        >
        this I want \" while I dont want this \"
        >
        and NOT:
        >
        this I want \" while I dont want this \\"
        >
        I tried even the (?<=...) construction but here I get an unbalanced
        paranthesis error.
        >
        It seems tha re is not able to do the job due to parsing/compiling
        problems for this sort of strings.
        >
        >
        Have you any idea??
        The problem is underspecified. Should r'\\"' become r'\\\"' or remain
        unchanged? If the backslash is supposed to escape the following letter
        including another backslash -- that can't be done with regular expressions
        alone:

        # John's proposal:
        >>print re.sub(r'(?<!\\ )"', r'\"', 'no " one \\", two \\\\"')
        no \" one \", two \\"


        One possible fix:
        >>parts = re.compile("(\\ \\.)").split('n o " one \\", two \\\\"')
        >>parts[::2] = [p.replace('"', '\\"') for p in parts[::2]]
        >>print "".join(par ts)
        no \" one \", two \\\"

        Peter

        Comment

        • anton

          #5
          Re: using re module to find

          John Machin <sjmachin <atlexicon.netw rites:
          >
          On Jun 12, 7:11 pm, anton <anto...@gmx.de wrote:
          Hi,

          I want to replace all occourences of " by \" in a string.

          But I want to leave all occourences of \" as they are.

          The following should happen:

          this I want " while I dont want this \"
          .... cut text off
          What you want is:
          >
          import re
          text = r'frob this " avoid this \", OK?'
          >text
          'frob this " avoid this \\", OK?'
          re.sub(r'(?<!\\ )"', r'\"', text)
          frob this \\" avoid this \\", OK?'
          >
          >
          HTH,
          John
          --

          >
          >

          First.. thanks John.

          The whole problem is discussed in

          Author, A.M. Kuchling < amk@amk.ca>,. Abstract: This document is an introductory tutorial to using regular expressions in Python with the re module. It provides a gentler introduction than th...


          in the section "The Backslash Plague"

          Unfortunately this is *NOT* mentioned in the standard
          python documentation of the re module.

          Another thing which will always remain strange to me, is that
          even if in the python doc of raw string:



          its written:
          "Specifical ly, a raw string cannot end in a single backslash"

          s=r"\\" # works fine
          s=r"\" # works not (as stated)

          But both ENDS IN A SINGLE BACKSLASH !

          The main thing which is hard to understand is:

          If a raw string is a string which ignores backslashes,
          then it should ignore them in all circumstances,

          or where could be the problem here (python parser somewhere??).

          Bye

          Anton


          Comment

          • John Machin

            #6
            Re: using re module to find

            On Jun 13, 6:23 pm, anton <anto...@gmx.de wrote:
            John Machin <sjmachin <atlexicon.netw rites:
            >
            >
            >
            On Jun 12, 7:11 pm, anton <anto...@gmx.de wrote:
            Hi,
            >
            I want to replace all occourences of " by \" in a string.
            >
            But I want to leave all occourences of \" as they are.
            >
            The following should happen:
            >
            this I want " while I dont want this \"
            >
            ... cut text off
            >
            >
            >
            What you want is:
            >
            >import re
            >text = r'frob this " avoid this \", OK?'
            >>text
            'frob this " avoid this \\", OK?'
            >re.sub(r'(?<!\ \)"', r'\"', text)
            frob this \\" avoid this \\", OK?'
            >>
            First.. thanks John.
            >
            The whole problem is discussed in
            >
            Author, A.M. Kuchling < amk@amk.ca>,. Abstract: This document is an introductory tutorial to using regular expressions in Python with the re module. It provides a gentler introduction than th...

            >
            in the section "The Backslash Plague"
            >
            Unfortunately this is *NOT* mentioned in the standard
            python documentation of the re module.
            Yes, and there's more to driving a car in heavy traffic than you will
            find in the manufacturer's manual.
            >
            Another thing which will always remain strange to me, is that
            even if in the python doc of raw string:
            >

            >
            its written:
            "Specifical ly, a raw string cannot end in a single backslash"
            >
            s=r"\\" # works fine
            s=r"\" # works not (as stated)
            >
            But both ENDS IN A SINGLE BACKSLASH !
            Apply the interpretation that the first case ends in a double
            backslash, and move on.
            >
            The main thing which is hard to understand is:
            >
            If a raw string is a string which ignores backslashes,
            then it should ignore them in all circumstances,
            Nobody defines a raw string to be a "string that ignores backslashes",
            so your premise is invalid.
            or where could be the problem here (python parser somewhere??).
            Why r"\" is not a valid string token has been done to death IIRC at
            least twice in this newsgroup ...

            Cheers,
            John

            Comment

            • Paul McGuire

              #7
              Re: using re module to find &quot; but not &quot; alone ... is this a BUG inre?

              On Jun 12, 4:11 am, anton <anto...@gmx.de wrote:
              Hi,
              >
              I want to replace all occourences of " by \" in a string.
              >
              But I want to leave all occourences of \" as they are.
              >
              The following should happen:
              >
                this I want " while I dont want this \"
              >
              should be transformed to:
              >
                this I want \" while I dont want this \"
              >
              and NOT:
              >
                this I want \" while I dont want this \\"
              >
              A pyparsing version is not as terse as an re, and certainly not as
              fast, but it is easy enough to read. Here is my first brute-force
              approach to your problem:

              from pyparsing import Literal, replaceWith

              escQuote = Literal(r'\"')
              unescQuote = Literal(r'"')
              unescQuote.setP arseAction(repl aceWith(r'\"'))

              test1 = r'this I want " while I dont want this \"'
              test2 = r'frob this " avoid this \", OK?'

              for test in (test1, test2):
              print (escQuote | unescQuote).tra nsformString(te st)

              And it prints out the desired:

              this I want \" while I dont want this \"
              frob this \" avoid this \", OK?

              This works by defining both of the patterns escQuote and unescQuote,
              and only defines a transforming parse action for the unescQuote. By
              listing escQuote first in the list of patterns to match, properly
              escaped quotes are skipped over.

              Then I looked at your problem slightly differently - why not find both
              '\"' and '"', and replace either one with '\"'. In some cases, I'm
              "replacing" '\"' with '\"', but so what? Here is the simplfied
              transformer:

              from pyparsing import Optional, replaceWith

              quotes = Optional(r'\\') + '"'
              quotes.setParse Action(replaceW ith(r'\"'))
              for test in (test1, test2):
              print quotes.transfor mString(test)


              Again, this prints out the desired output.

              Now let's retrofit this altered logic back onto John Machin's
              solution:

              import re
              for test in (test1, test2):
              print re.sub(r'\\?"', r'\"', test)


              Pretty short and sweet, and pretty readable for an re.

              To address Peter Otten's question about what to do with an escaped
              backslash, I can't compose this with an re, but I can by adjusting the
              first pyparsing version to include an escaped backslash as a "match
              but don't do anything with it" expression, just like we did with
              escQuote:

              from pyparsing import Optional, Literal, replaceWith

              escQuote = Literal(r'\"')
              unescQuote = Literal(r'"')
              unescQuote.setP arseAction(repl aceWith(r'\"'))
              backslash = chr(92)
              escBackslash = Literal(backsla sh+backslash)

              test3 = r'no " one \", two \\"'
              for test in (test1, test2, test3):
              print (escBackslash | escQuote |
              unescQuote).tra nsformString(te st)

              Prints:
              this I want \" while I dont want this \"
              frob this \" avoid this \", OK?
              no \" one \", two \\\"

              At first I thought the last transform was an error, but on closer
              inspection, I see that the input line ends with an escaped backslash,
              followed by a lone '"', which must be replaced with '\"'. So in the
              transformed version we see '\\\"', the original escaped backslash,
              followed by the replacement '\"' string.

              Cheers,
              -- Paul

              Comment

              Working...