problem with regular expression?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Guest's Avatar

    problem with regular expression?

    I'm trying to scan a (binary) file for a string matching a particular
    pattern, and am getting unexpected results. I don't know if this is a bug
    or just my own misunderstandin g of regular expressions.

    The string I'm searching for is a "versioned file name" of the form:
    "AMS_epXXXx.flt ", where 'XXX' is 1 to 3 numerals, the 'x' is lower case 'a-z',
    and the '_' and 'ep' are each optional. In other words, the following are
    examples that match:

    AMSep12a.flt
    ams_ep101b.flt
    ams_123z.flt
    ams12z.flt

    The regular expression pattern I'm using is:

    prefix='ams'
    pat = re.compile(pref ix + r'(?:(_)?(ep)?([0-9]{1,3}[a-z])\.flt)', re.I)


    I'm using the parenthesized groups to conditionally process the match, i.e.,
    if there is no '_' or 'ep' in the name, I still want the match but handle
    it differently. In my pattern above, group 1 is the (_) group, group 2 is
    the (ep) group, and group 3 is the "version string" group.

    The problem I'm having is that the following string of bytes (hex data from
    a file I'm scanning) returns a '_' in match group 1 even though it is
    outside the filename pattern that is properly detected:

    Here's a code snippet to illustrate:


    #============== =============== =============== =============== =============== ==
    import binascii, re

    prefix = 'ams'
    #...
    pat = re.compile(pref ix + r'(?:(_)?(ep)?([0-9]{1,3}[a-z])\.flt)', re.I)

    #...scan file...

    #-------------------------
    # bytes in problem string (note that this section is arbitrary and not part
    # of the actual problem; it's just my attempt at converting the output of
    # a hexdump file utility into a python string so as to illustrate the problem
    # in a self-contained test case:

    # problem data in file:
    #
    # 000a 0004 0002 0020 414d 535f 6a75 6c00
    # 0000 0000 0000 0000 0000 0000 0000 0000
    # 0000 0000 000a 0004 003f 00d8 414d 5365
    # 7031 3031 692e 666c 7400 0000 0000 0000
    # 000a 0004 0002 0020 414d 535f 6a75 6c00 ....... AMS_jul.
    # 0000 0000 0000 0000 0000 0000 0000 0000 ............... .
    # 0000 0000 000a 0004 003f 00d8 414d 5365 .........?..AMS e
    # 7031 3031 692e 666c 7400 0000 0000 0000 p101i.flt...... .

    bytes = '000a0004000200 20414d535f6a756 c00000000000000 000000000000000 000000000000000 0a0004003f00d84 14d536570313031 692e666c7400000 000000000'

    ascii = binascii.a2b_he x(bytes)
    #-------------------------

    m = pat.search(asci i)
    print m.groups()
    print m.span(0), m.span(1), m.span(2), m.span(3)

    #output: ('_', 'ep', '101i')
    # (44, 57) (11, 12) (47, 49) (49, 53)
    #
    # Note that the '_' reported at position 11 in "AMS_jul" is outside the
    # range of the "real" matched string "AMSep101i. flt" at positions (44-57)!

Working...