regex question

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • mathieu

    regex question

    I do not understand what is wrong with the following regex expression.
    I clearly mark that the separator in between group 3 and group 4
    should contain at least 2 white space, but group 3 is actually reading
    3 +4

    Thanks
    -Mathieu

    import re

    line = " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings
    Auto Window Width SL 1 "
    patt = re.compile("^\s *\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
    -]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
    $")
    m = patt.match(line )
    if m:
    print m.group(3)
    print m.group(4)
  • Wanja Chresta

    #2
    Re: regex question

    Hey Mathieu

    Due to word wrap I'm not sure what you want to do. What result do you
    expect? I get:
    >>print m.groups()
    ('0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings Auto Window
    Width ', ' ', 'SL', '1')
    But only when I insert a space in the 3rd char group (I'm not sure if
    your original pattern has a space there or not). So the third group is:
    ([A-Za-z0-9./:_ -]+). If I do not insert the space, the pattern does not
    match the line.

    I also cant see how the format of your line is. If it is like this:
    line = "...Siemens : Thorax/Multix FD Lab Settings Auto Window Width..."
    where "Auto Window Width" should be the 4th group, you have to mark the
    + in the 3rd group as non-greedy (it's done with a "?"):

    ([A-Za-z0-9./:_ -]+?)
    With that I get:
    >>patt.match(li ne).groups()
    ('0021', 'xx0A', 'Siemens: Thorax/Multix FD Lab Settings', 'Auto Window
    Width ', 'SL', '1')
    Which probably is what you want. You can also add the non-greedy marker
    in the fourth group, to get rid of the tailing spaces.

    HTH
    Wanja


    mathieu wrote:
    I clearly mark that the separator in between group 3 and group 4
    should contain at least 2 white space, but group 3 is actually reading
    3 +4

    Comment

    • bearophileHUGS@lycos.com

      #3
      Re: regex question

      mathieu, stop writing complex REs like obfuscated toys, use the
      re.VERBOSE flag and split that RE into several commented and
      *indented* lines (indented just like Python code), the indentation
      level has to be used to denote nesting. With that you may be able to
      solve the problem by yourself. If not, you can offer us a much more
      readable thing to fix.

      Bye,
      bearophile

      Comment

      • Paul McGuire

        #4
        Re: regex question

        On Feb 13, 6:53 am, mathieu <mathieu.malate ...@gmail.comwr ote:
        I do not understand what is wrong with the following regex expression.
        I clearly mark that the separator in between group 3 and group 4
        should contain at least 2 white space, but group 3 is actually reading
        3 +4
        >
        Thanks
        -Mathieu
        >
        import re
        >
        line = "      (0021,xx0A)   Siemens: Thorax/Multix FD Lab Settings
        Auto Window Width          SL   1 "
        patt = re.compile("^\s *\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
        -]+)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
        $")
        <snip>

        I love the smell of regex'es in the morning!

        For more legible posting (and general maintainability ), try breaking
        up your quoted strings like this:

        line = \
        " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " \
        "Auto Window Width SL 1 "

        patt = re.compile(
        "^\s*"
        "\("
        "([0-9A-Z]+),"
        "([0-9A-Zx]+)"
        "\)\s+"
        "([A-Za-z0-9./:_ -]+)\s\s+"
        "([A-Za-z0-9 ()._,/#>-]+)\s+"
        "([A-Z][A-Z]_?O?W?)\s+"
        "([0-9n-]+)\s*$")


        Of course, the problem is that you have a greedy match in the part of
        the regex that is supposed to stop between "Settings" and "Auto".
        Change patt to:

        patt = re.compile(
        "^\s*"
        "\("
        "([0-9A-Z]+),"
        "([0-9A-Zx]+)"
        "\)\s+"
        "([A-Za-z0-9./:_ -]+?)\s\s+"
        "([A-Za-z0-9 ()._,/#>-]+)\s+"
        "([A-Z][A-Z]_?O?W?)\s+"
        "([0-9n-]+)\s*$")

        or if you prefer:

        patt = re.compile("^\s *\(([0-9A-Z]+),([0-9A-Zx]+)\)\s+([A-Za-z0-9./:_
        -]+?)\s\s+([A-Za-z0-9 ()._,/#>-]+)\s+([A-Z][A-Z]_?O?W?)\s+([0-9n-]+)\s*
        $")

        It looks like you wrote this regex to process this specific input
        string - it has a fragile feel to it, as if you will have to go back
        and tweak it to handle other data that might come along, such as

        (xx42,xx0A) Honeywell: Inverse Flitznoid (Kelvin)
        80 SL 1


        Just out of curiosity, I wondered what a pyparsing version of this
        would look like. See below:

        from pyparsing import Word,hexnums,de limitedList,pri ntables,\
        White,Regex,num s

        line = \
        " (0021,xx0A) Siemens: Thorax/Multix FD Lab Settings " \
        "Auto Window Width SL 1 "

        # define fields
        hexint = Word(hexnums+"x ")
        text = delimitedList(W ord(printables) ,
        delim=White(" ",exact=1), combine=True)
        type_label = Regex("[A-Z][A-Z]_?O?W?")
        int_label = Word(nums+"n-")

        # define line structure - give each field a name
        line_defn = "(" + hexint("x") + "," + hexint("y") + ")" + \
        text("desc") + text("window") + type_label("typ e") + \
        int_label("int" )

        line_parts = line_defn.parse String(line)
        print line_parts.dump ()
        print line_parts.desc

        Prints:
        ['(', '0021', ',', 'xx0A', ')', 'Siemens: Thorax/Multix FD Lab
        Settings', 'Auto Window Width', 'SL', '1']
        - desc: Siemens: Thorax/Multix FD Lab Settings
        - int: 1
        - type: SL
        - window: Auto Window Width
        - x: 0021
        - y: xx0A
        Siemens: Thorax/Multix FD Lab Settings

        I was just guessing on the field names, but you can see where they are
        defined and change them to the appropriate values.

        -- Paul

        Comment

        Working...