Improving my text processing script

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • pruebauno@latinmail.com

    Improving my text processing script

    I am sure there is a better way of writing this, but how?

    import re
    f=file('tlst')
    tlst=f.read().s plit('\n')
    f.close()
    f=file('plst')
    sep=re.compile( 'Identifier "(.*?)"')
    plst=[]
    for elem in f.read().split( 'Identifier'):
    content='Identi fier'+elem
    match=sep.searc h(content)
    if match:
    plst.append((ma tch.group(1),co ntent))
    f.close()
    flst=[]
    for table in tlst:
    for prog,content in plst:
    if content.find(ta ble)>0:
    flst.append('"% s","%s"'%(prog, table))
    flst.sort()
    for elem in flst:
    print elem



    What would be the best way of writing this program. BTW find>0 to check
    in case table=='' (empty line) so I do not include everything.

    tlst is of the form:

    tablename1
    tablename2

    ....

    plst is of the form:

    Identifier "Program1"
    Name "Random Stuff"
    Value "tablename2 "
    ....other random properties
    Name "More Random Stuff"
    Identifier "Program 2"
    Name "Yet more stuff"
    Value "tablename2 "
    ....


    I want to know in what programs are the tables in tlst (and only those)
    used.

  • Paul McGuire

    #2
    Re: Improving my text processing script

    Even though you are using re's to try to look for specific substrings
    (which you sort of fake in by splitting on "Identifier ", and then
    prepending "Identifier " to every list element, so that the re will
    match...), this program has quite a few holes.

    What if the word "Identifier " is inside one of the quoted strings?
    What if the actual value is "tablename1 0"? This will match your
    "tablename1 " string search, but it is certainly not what you want.
    Did you know there are trailing blanks on your table names, which could
    prevent any program name from matching?

    So here is an alternative approach using, as many have probably
    predicted by now if they've spent any time on this list, the pyparsing
    module. You may ask, "isn't a parser overkill for this problem?" and
    the answer will likely be "probably", but in the case of pyparsing, I'd
    answer "probably, but it is so easy, and takes care of so much junk
    like dealing with quoted strings and intermixed data, so, who cares if
    it's overkill?"

    So here is the 20-line pyparsing solution, insert it into your program
    after you have read in tlst, and read in the input data using something
    like data = file('plst).rea d(). (The first line strips the whitespace
    from the ends of your table names.)

    tlist = map(str.rstrip, tlist)

    from pyparsing import quotedString,Li neStart,LineEnd ,removeQuotes
    quotedString.se tParseAction( removeQuotes )

    identLine = (LineStart() + "Identifier " + quotedString +
    LineEnd()).setR esultsName("ide ntifier")
    tableLine = (LineStart() + "Value" + quotedString +
    LineEnd()).setR esultsName("tab leref")

    interestingLine s = ( identLine | tableLine )
    thisprog = ""
    for toks,start,end in interestingLine s.scanString( data ):
    toktype = toks.getName()
    if toktype == 'identifier':
    thisprog = toks[1]
    elif toktype == 'tableref':
    thistable = toks[1]
    if thistable in tlist:
    print '"%s","%s"' % (thisprog, thistable)
    else:
    print "Not", thisprog, "contains wrong table
    ("+thistable+") "

    This program will print out:
    "Program1","tab lename2"
    "Program 2","tablenam e2"


    Download pyparsing at http://pyparsing.sourceforge.net.

    -- Paul

    Comment

    • Miki Tebeka

      #3
      Re: Improving my text processing script

      Hello pruebauno,
      [color=blue]
      > import re
      > f=file('tlst')
      > tlst=f.read().s plit('\n')
      > f.close()[/color]
      tlst = open("tlst").re adlines()
      [color=blue]
      > f=file('plst')
      > sep=re.compile( 'Identifier "(.*?)"')
      > plst=[]
      > for elem in f.read().split( 'Identifier'):
      > content='Identi fier'+elem
      > match=sep.searc h(content)
      > if match:
      > plst.append((ma tch.group(1),co ntent))
      > f.close()[/color]
      Look at re.findall, I think it'll be easier.
      [color=blue]
      > flst=[]
      > for table in tlst:
      > for prog,content in plst:
      > if content.find(ta ble)>0:[/color]
      if table in content:[color=blue]
      > flst.append('"% s","%s"'%(prog, table))[/color]
      [color=blue]
      > flst.sort()
      > for elem in flst:
      > print elem[/color]
      print "\n".join(sorte d(flst))

      HTH.
      --
      ------------------------------------------------------------------------
      Miki Tebeka <miki.tebeka@zo ran.com>

      The only difference between children and adults is the price of the toys

      -----BEGIN PGP SIGNATURE-----
      Version: GnuPG v1.4.1 (Cygwin)

      iD8DBQFDFrzO8jA dENsUuJsRAk42AJ 0Q2CEr8e+1/ZLLhadgxtz879oR OACggk24
      /2SSAFEgEVbS/SmT6cl17xo=
      =OF21
      -----END PGP SIGNATURE-----

      Comment

      • pruebauno@latinmail.com

        #4
        Re: Improving my text processing script

        Paul McGuire wrote:[color=blue]
        > match...), this program has quite a few holes.
        >
        > What if the word "Identifier " is inside one of the quoted strings?
        > What if the actual value is "tablename1 0"? This will match your
        > "tablename1 " string search, but it is certainly not what you want.
        > Did you know there are trailing blanks on your table names, which could
        > prevent any program name from matching?[/color]

        Good point. I did not think about that. I got lucky because none of the
        table names had trailing blanks (google groups seems to add those) the
        word identifier is not used inside of quoted strings anywhere and I do
        not have tablename10, but I do have "dba.tablename1 " and that one has
        to match with tablename1 (and magically did).
        [color=blue]
        >
        > So here is an alternative approach using, as many have probably
        > predicted by now if they've spent any time on this list, the pyparsing
        > module. You may ask, "isn't a parser overkill for this problem?" and[/color]

        You had to plug pyparsing! :-). Thanks for the info I did not know
        something like pyparsing existed. Thanks for the code too, because
        looking at the module it was not totally obvious to me how to use it. I
        tried run it though and it is not working for me. The following code
        runs but prints nothing at all:

        import pyparsing as prs

        f=file('tlst'); tlst=[ln.strip() for ln in f if ln]; f.close()
        f=file('plst'); plst=f.read() ; f.close()

        prs.quotedStrin g.setParseActio n(prs.removeQuo tes)

        identLine=(prs. LineStart()
        + 'Identifier'
        + prs.quotedStrin g
        + prs.LineEnd()
        ).setResultsNam e('prog')

        tableLine=(prs. LineStart()
        + 'Value'
        + prs.quotedStrin g
        + prs.LineEnd()
        ).setResultsNam e('table')

        interestingLine s=(identLine | tableLine)

        for toks,start,end in interestingLine s.scanString(pl st):
        print toks,start,end

        Comment

        • pruebauno@latinmail.com

          #5
          Re: Improving my text processing script

          Miki Tebeka wrote:
          [color=blue]
          > Look at re.findall, I think it'll be easier.[/color]

          Minor changes aside the interesting thing, as you pointed out, would be
          using re.findall. I could not figure out how to.

          Comment

          • pruebauno@latinmail.com

            #6
            Re: Improving my text processing script

            pruebauno@latin mail.com wrote:[color=blue]
            > Paul McGuire wrote:[color=green]
            > > match...), this program has quite a few holes.[/color][/color]
            [color=blue]
            > tried run it though and it is not working for me. The following code
            > runs but prints nothing at all:
            >
            > import pyparsing as prs
            >[/color]
            And this is the point where I have to post the real stuff because your
            code works with the example i posted and not with the real thing. The
            identifier I am interested in is (if I understood the the requirements
            correctly) the one after the "title with the stars"

            So here is the "real" data for tlst some info replaced with z to
            protect privacy:

            *************** *************** *************** *************** *************** **


            Identifier "zzz0main"


            *************** *************** *************** *************** *************** **


            Identifier "zz501"


            Value "zzz_CLCL_zzzz, zzzzzz_ID"


            Name "zzzzz"


            Name "zzzzzz"


            *************** *************** *************** *************** *************** **


            Identifier "zzzz3main"


            *************** *************** *************** *************** *************** **


            Identifier "zzz505"


            Value "dba.zzz_CKPY_z zzz_SUM"


            Name "xxx_xxx_xxx_DT "


            ----------------------------------


            Value "zzz_zzzz_zzz_z zz"


            Name "zzz_zz_zzz "


            ----------------------------------


            Value "zzz_zzz_zzz_HI ST"


            Name "zzz_zzz"


            ----------------------------------


            Comment

            • Paul McGuire

              #7
              Re: Improving my text processing script

              Yes indeed, the real data often has surprising differences from the
              simulations! :)

              It turns out that pyparsing LineStart()'s are pretty fussy. Usually,
              pyparsing is very forgiving about whitespace between expressions, but
              it turns out that LineStart *must* be followed by the next expression,
              with no leading whitespace.

              Fortunately, your syntax is really quite forgiving, in that your
              key-value pairs appear to always be an unquoted word (for the key) and
              a quoted string (for the value). So you should be able to get this
              working just by dropping the LineStart()'s from your expressions, that
              is:

              identLine=('Ide ntifier'
              + prs.quotedStrin g
              + prs.LineEnd()
              ).setResultsNam e('prog')


              tableLine=('Val ue'
              + prs.quotedStrin g
              + prs.LineEnd()
              ).setResultsNam e('table')

              See if that works any better for you.

              -- Paul

              Comment

              Working...