Multiline regex help

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Yatima

    Multiline regex help

    Hey Folks,

    I've got some info in a bunch of files that kind of looks like so:

    Gibberish
    53
    MoreGarbage
    12
    RelevantInfo1
    10/10/04
    NothingImportan t
    ThisDoesNotMatt er
    44
    RelevantInfo2
    22
    BlahBlah
    343
    RelevantInfo3
    23
    Hubris
    Crap
    34

    and so on...

    Anyhow, these "fields" repeat several times in a given file (number of
    repetitions varies from file to file). The number on the line following the
    "RelevantIn fo" lines is really what I'm after. Ideally, I would like to have
    something like so:

    RelevantInfo1 = 10/10/04 # The variable name isn't actually important
    RelevantInfo3 = 23 # it's just there to illustrate what info I'm
    # trying to snag.

    Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2

    Collected from all of the files.

    So, there would be several of these "scores" per file and there are a bunch
    of files. Ultimately, I am interested in printing them out as a csv file but
    that should be relatively easy once they are trapped in my array of doom
    <cue evil laughter>.

    I've got a fairly ugly "solution" (I am using this term *very* loosely)
    using awk and his faithfail companion sed, but I would prefer something in
    python.

    Thanks for your time.

    --
    McGowan's Madison Avenue Axiom:
    If an item is advertised as "under $50", you can bet it's not $19.95.
  • Kent Johnson

    #2
    Re: Multiline regex help

    Yatima wrote:[color=blue]
    > Hey Folks,
    >
    > I've got some info in a bunch of files that kind of looks like so:
    >
    > Gibberish
    > 53
    > MoreGarbage
    > 12
    > RelevantInfo1
    > 10/10/04
    > NothingImportan t
    > ThisDoesNotMatt er
    > 44
    > RelevantInfo2
    > 22
    > BlahBlah
    > 343
    > RelevantInfo3
    > 23
    > Hubris
    > Crap
    > 34
    >
    > and so on...
    >
    > Anyhow, these "fields" repeat several times in a given file (number of
    > repetitions varies from file to file). The number on the line following the
    > "RelevantIn fo" lines is really what I'm after. Ideally, I would like to have
    > something like so:
    >
    > RelevantInfo1 = 10/10/04 # The variable name isn't actually important
    > RelevantInfo3 = 23 # it's just there to illustrate what info I'm
    > # trying to snag.[/color]

    Here is a way to create a list of [RelevantInfo, value] pairs:
    import cStringIO

    raw_data = '''Gibberish
    53
    MoreGarbage
    12
    RelevantInfo1
    10/10/04
    NothingImportan t
    ThisDoesNotMatt er
    44
    RelevantInfo2
    22
    BlahBlah
    343
    RelevantInfo3
    23
    Hubris
    Crap
    34'''
    raw_data = cStringIO.Strin gIO(raw_data)

    data = []
    for line in raw_data:
    if line.startswith ('RelevantInfo' ):
    key = line.strip()
    value = raw_data.next() .strip()
    data.append([key, value])

    print data

    [color=blue]
    >
    > Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2[/color]

    I'm not sure what you mean by this. Do you want to build a Score dictionary as well?

    Kent
    [color=blue]
    >
    > Collected from all of the files.
    >
    > So, there would be several of these "scores" per file and there are a bunch
    > of files. Ultimately, I am interested in printing them out as a csv file but
    > that should be relatively easy once they are trapped in my array of doom
    > <cue evil laughter>.
    >
    > I've got a fairly ugly "solution" (I am using this term *very* loosely)
    > using awk and his faithfail companion sed, but I would prefer something in
    > python.
    >
    > Thanks for your time.
    >[/color]

    Comment

    • Steven Bethard

      #3
      Re: Multiline regex help

      Yatima wrote:[color=blue]
      > Hey Folks,
      >
      > I've got some info in a bunch of files that kind of looks like so:
      >
      > Gibberish
      > 53
      > MoreGarbage
      > 12
      > RelevantInfo1
      > 10/10/04
      > NothingImportan t
      > ThisDoesNotMatt er
      > 44
      > RelevantInfo2
      > 22
      > BlahBlah
      > 343
      > RelevantInfo3
      > 23
      > Hubris
      > Crap
      > 34
      >
      > and so on...
      >
      > Anyhow, these "fields" repeat several times in a given file (number of
      > repetitions varies from file to file). The number on the line following the
      > "RelevantIn fo" lines is really what I'm after. Ideally, I would like to have
      > something like so:
      >
      > RelevantInfo1 = 10/10/04 # The variable name isn't actually important
      > RelevantInfo3 = 23 # it's just there to illustrate what info I'm
      > # trying to snag.
      >
      > Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2[/color]

      A possible solution, using the re module:

      py> s = """\
      .... Gibberish
      .... 53
      .... MoreGarbage
      .... 12
      .... RelevantInfo1
      .... 10/10/04
      .... NothingImportan t
      .... ThisDoesNotMatt er
      .... 44
      .... RelevantInfo2
      .... 22
      .... BlahBlah
      .... 343
      .... RelevantInfo3
      .... 23
      .... Hubris
      .... Crap
      .... 34
      .... """
      py> import re
      py> m = re.compile(r""" ^RelevantInfo1\ n([^\n]*)
      .... .*
      .... ^RelevantInfo2\ n([^\n]*)
      .... .*
      .... ^RelevantInfo3\ n([^\n]*)""",
      .... re.DOTALL | re.MULTILINE | re.VERBOSE)
      py> score = {}
      py> for info1, info2, info3 in m.findall(s):
      .... score.setdefaul t(info1, {})[info3] = info2
      ....
      py> score
      {'10/10/04': {'23': '22'}}

      Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE
      to have ^ apply at the start of each line, and VERBOSE to allow me to
      write the re in a more readable form.

      If I didn't get your dict update quite right, hopefully you can see how
      to fix it!

      HTH,

      STeVe

      Comment

      • Yatima

        #4
        Re: Multiline regex help

        On Thu, 03 Mar 2005 09:54:02 -0700, Steven Bethard <steven.bethard @gmail.com> wrote:[color=blue]
        >
        > A possible solution, using the re module:
        >
        > py> s = """\
        > ... Gibberish
        > ... 53
        > ... MoreGarbage
        > ... 12
        > ... RelevantInfo1
        > ... 10/10/04
        > ... NothingImportan t
        > ... ThisDoesNotMatt er
        > ... 44
        > ... RelevantInfo2
        > ... 22
        > ... BlahBlah
        > ... 343
        > ... RelevantInfo3
        > ... 23
        > ... Hubris
        > ... Crap
        > ... 34
        > ... """
        > py> import re
        > py> m = re.compile(r""" ^RelevantInfo1\ n([^\n]*)
        > ... .*
        > ... ^RelevantInfo2\ n([^\n]*)
        > ... .*
        > ... ^RelevantInfo3\ n([^\n]*)""",
        > ... re.DOTALL | re.MULTILINE | re.VERBOSE)
        > py> score = {}
        > py> for info1, info2, info3 in m.findall(s):
        > ... score.setdefaul t(info1, {})[info3] = info2
        > ...
        > py> score
        > {'10/10/04': {'23': '22'}}
        >
        > Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE
        > to have ^ apply at the start of each line, and VERBOSE to allow me to
        > write the re in a more readable form.
        >
        > If I didn't get your dict update quite right, hopefully you can see how
        > to fix it![/color]

        Thanks! That was very helpful. Unfortunately, I wasn't completely clear when
        describing the problem. Is there anyway to extract multiple scores from the
        same file and from multiple files (I will probably use the "fileinput"
        module to deal with multiple files). So, if I've got say:

        Gibberish
        53
        MoreGarbage
        12
        RelevantInfo1
        10/10/04
        NothingImportan t
        ThisDoesNotMatt er
        44
        RelevantInfo2
        22
        BlahBlah
        343
        RelevantInfo3
        23
        Hubris
        Crap
        34

        SecondSetofGarb age
        2423
        YouGetThePictur e
        342342
        RelevantInfo1
        10/10/04
        HoHum
        343
        MoreStuffNotNee ded
        232
        RelevantInfo2
        33
        RelevantInfo3
        44
        sdfsdf
        RelevantInfo1
        10/11/04
        InsertBoringFil lerHere
        43234
        Stuff
        MoreStuff
        RelevantInfo2
        45
        ExcitingIsntIt
        324234
        RelevantInfo3
        60
        Lalala

        Sorry for the long and painful example input. Notice that the first two
        "RelevantIn fo1" fields have the same info but that the RelevantInfo2 and
        RelevantInfo3 fields have different info. Also, there will be cases where
        RelevantInfo3 might be the same with a different RelevantInfo2. What, I'm
        hoping for is something along then lines of being able to organize it like
        so (don't worry about the format of the output -- I'll deal with that
        later; "RelevantIn fo" shortened to "Info" for readability):

        Info1[0], Info[1], Info[2] ...
        Info3[0] Info2[Info1[0],Info3[0]] Info2[Info1[1],Info3[1]] ...
        Info3[1] Info2[Info1[0],Info3[1]] ...
        Info3[2] Info2[Info1[0],Info3[2]] ...
        ....

        I don't really care if it's a list, dictionary, array etc.

        Thanks again for your help. The multiline option in the re module is very
        useful.

        Take care.

        --
        Clarke's Conclusion:
        Never let your sense of morals interfere with doing the right thing.

        Comment

        • James Stroud

          #5
          Re: Multiline regex help

          Have a look at "martel", part of biopython. The world of bioinformatics is
          filled with files with structure like this.



          James

          On Thursday 03 March 2005 12:03 pm, Yatima wrote:[color=blue]
          > On Thu, 03 Mar 2005 09:54:02 -0700, Steven Bethard[/color]
          <steven.bethard @gmail.com> wrote:[color=blue][color=green]
          > > A possible solution, using the re module:
          > >
          > > py> s = """\
          > > ... Gibberish
          > > ... 53
          > > ... MoreGarbage
          > > ... 12
          > > ... RelevantInfo1
          > > ... 10/10/04
          > > ... NothingImportan t
          > > ... ThisDoesNotMatt er
          > > ... 44
          > > ... RelevantInfo2
          > > ... 22
          > > ... BlahBlah
          > > ... 343
          > > ... RelevantInfo3
          > > ... 23
          > > ... Hubris
          > > ... Crap
          > > ... 34
          > > ... """
          > > py> import re
          > > py> m = re.compile(r""" ^RelevantInfo1\ n([^\n]*)
          > > ... .*
          > > ... ^RelevantInfo2\ n([^\n]*)
          > > ... .*
          > > ... ^RelevantInfo3\ n([^\n]*)""",
          > > ... re.DOTALL | re.MULTILINE | re.VERBOSE)
          > > py> score = {}
          > > py> for info1, info2, info3 in m.findall(s):
          > > ... score.setdefaul t(info1, {})[info3] = info2
          > > ...
          > > py> score
          > > {'10/10/04': {'23': '22'}}
          > >
          > > Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE
          > > to have ^ apply at the start of each line, and VERBOSE to allow me to
          > > write the re in a more readable form.
          > >
          > > If I didn't get your dict update quite right, hopefully you can see how
          > > to fix it![/color]
          >
          > Thanks! That was very helpful. Unfortunately, I wasn't completely clear
          > when describing the problem. Is there anyway to extract multiple scores
          > from the same file and from multiple files (I will probably use the
          > "fileinput" module to deal with multiple files). So, if I've got say:
          >
          > Gibberish
          > 53
          > MoreGarbage
          > 12
          > RelevantInfo1
          > 10/10/04
          > NothingImportan t
          > ThisDoesNotMatt er
          > 44
          > RelevantInfo2
          > 22
          > BlahBlah
          > 343
          > RelevantInfo3
          > 23
          > Hubris
          > Crap
          > 34
          >
          > SecondSetofGarb age
          > 2423
          > YouGetThePictur e
          > 342342
          > RelevantInfo1
          > 10/10/04
          > HoHum
          > 343
          > MoreStuffNotNee ded
          > 232
          > RelevantInfo2
          > 33
          > RelevantInfo3
          > 44
          > sdfsdf
          > RelevantInfo1
          > 10/11/04
          > InsertBoringFil lerHere
          > 43234
          > Stuff
          > MoreStuff
          > RelevantInfo2
          > 45
          > ExcitingIsntIt
          > 324234
          > RelevantInfo3
          > 60
          > Lalala
          >
          > Sorry for the long and painful example input. Notice that the first two
          > "RelevantIn fo1" fields have the same info but that the RelevantInfo2 and
          > RelevantInfo3 fields have different info. Also, there will be cases where
          > RelevantInfo3 might be the same with a different RelevantInfo2. What, I'm
          > hoping for is something along then lines of being able to organize it like
          > so (don't worry about the format of the output -- I'll deal with that
          > later; "RelevantIn fo" shortened to "Info" for readability):
          >
          > Info1[0], Info[1], Info[2]
          > ... Info3[0] Info2[Info1[0],Info3[0]] Info2[Info1[1],Info3[1]] ...
          > Info3[1] Info2[Info1[0],Info3[1]] ...
          > Info3[2] Info2[Info1[0],Info3[2]] ...
          > ...
          >
          > I don't really care if it's a list, dictionary, array etc.
          >
          > Thanks again for your help. The multiline option in the re module is very
          > useful.
          >
          > Take care.
          >
          > --
          > Clarke's Conclusion:
          > Never let your sense of morals interfere with doing the right thing.[/color]

          --
          James Stroud, Ph.D.
          UCLA-DOE Institute for Genomics and Proteomics
          Box 951570
          Los Angeles, CA 90095

          Comment

          • Yatima

            #6
            Re: Multiline regex help

            On Thu, 03 Mar 2005 07:14:50 -0500, Kent Johnson <kent37@tds.net > wrote:[color=blue]
            >
            > Here is a way to create a list of [RelevantInfo, value] pairs:
            > import cStringIO
            >
            > raw_data = '''Gibberish
            > 53
            > MoreGarbage
            > 12
            > RelevantInfo1
            > 10/10/04
            > NothingImportan t
            > ThisDoesNotMatt er
            > 44
            > RelevantInfo2
            > 22
            > BlahBlah
            > 343
            > RelevantInfo3
            > 23
            > Hubris
            > Crap
            > 34'''
            > raw_data = cStringIO.Strin gIO(raw_data)
            >
            > data = []
            > for line in raw_data:
            > if line.startswith ('RelevantInfo' ):
            > key = line.strip()
            > value = raw_data.next() .strip()
            > data.append([key, value])
            >
            > print data
            >[/color]

            Thank you. This isn't exactly what I'm looking for (I wasn't clear in
            describing the problem -- please see my reply to Steve for a, hopefully,
            better explanation) but it does give me a few ideas.[color=blue]
            >[color=green]
            >>
            >> Score[RelevantInfo1][RelevantInfo3] = 22 # The value from RelevantInfo2[/color]
            >
            > I'm not sure what you mean by this. Do you want to build a Score dictionary as well?[/color]

            Sure... Uhhh.. I think. Okay, what I want is some kind of awk-like
            associative array because the raw data files will have repeats for certain
            field vaues such that there would be, for example, multiple RelevantInfo2's
            and RelevantInfo3's for the same RelevantInfo1 (i.e. on the same date). To
            make matters more exciting, there will be multiple RelevantInfo1's (dates)
            for the same RelevantInfo3 (e.g. a subject ID). RelevantInfo2 will be the
            value for all unique combinations of RelevantInfo1 and RelevantInfo3. There
            will be multiple occurrences of these fields in the same file (original data
            sample was not very good for this reason) and multiple files as well. The
            interesting three fields will always be repeated in the same order although
            the amount of irrelevant data in between may vary. So:

            RelevantInfo1
            10/10/04
            <snipped crap>
            RelevantInfo2
            12
            <more snippage>
            RelevantInfo3
            43
            <more snippage>
            RelevantInfo1
            10/10/04 <- The same as the first occurrence of RelevantInfo1
            <snipped>
            RelevantInfo2
            22
            <snipped>
            RelevantInfo3
            25
            <snipped>
            RelevantInfo1
            10/11/04
            <snipped>
            RelevantInfo2
            34
            <snipped>
            RelevantInfo3
            28
            <snipped>
            RelevantInfo1
            10/12/04
            <snipped>
            RelevantInfo2
            98
            <snipped>
            RelevantInfo3
            25 <- The same as the second occurrence of RelevantInfo3
            ....

            Sorry for the long and tedious "data" example.

            There will be missing values for some combinations of RelevantInfo1 and
            RelevantInfo3 so hopefully that won't be an issue.

            Thanks again for your reply.

            Take care.

            --
            "I figured there was this holocaust, right, and the only ones left alive were
            Donna Reed, Ozzie and Harriet, and the Cleavers."
            -- Wil Wheaton explains why everyone in "Star Trek: The Next Generation"
            is so nice

            Comment

            • James Stroud

              #7
              Re: Multiline regex help

              I found the original paper for Martel:



              On Thursday 03 March 2005 12:26 pm, James Stroud wrote:[color=blue]
              > Have a look at "martel", part of biopython. The world of bioinformatics is
              > filled with files with structure like this.
              >
              > http://www.biopython.org/docs/api/pu...el-module.html
              >
              > James
              >
              > On Thursday 03 March 2005 12:03 pm, Yatima wrote:[/color]

              --
              James Stroud, Ph.D.
              UCLA-DOE Institute for Genomics and Proteomics
              Box 951570
              Los Angeles, CA 90095

              Comment

              • Steven Bethard

                #8
                Re: Multiline regex help

                Yatima wrote:[color=blue]
                > On Thu, 03 Mar 2005 09:54:02 -0700, Steven Bethard <steven.bethard @gmail.com> wrote:
                >[color=green]
                >>A possible solution, using the re module:
                >>
                >>py> s = """\
                >>... Gibberish
                >>... 53
                >>... MoreGarbage
                >>... 12
                >>... RelevantInfo1
                >>... 10/10/04
                >>... NothingImportan t
                >>... ThisDoesNotMatt er
                >>... 44
                >>... RelevantInfo2
                >>... 22
                >>... BlahBlah
                >>... 343
                >>... RelevantInfo3
                >>... 23
                >>... Hubris
                >>... Crap
                >>... 34
                >>... """
                >>py> import re
                >>py> m = re.compile(r""" ^RelevantInfo1\ n([^\n]*)
                >>... .*
                >>... ^RelevantInfo2\ n([^\n]*)
                >>... .*
                >>... ^RelevantInfo3\ n([^\n]*)""",
                >>... re.DOTALL | re.MULTILINE | re.VERBOSE)
                >>py> score = {}
                >>py> for info1, info2, info3 in m.findall(s):
                >>... score.setdefaul t(info1, {})[info3] = info2
                >>...
                >>py> score
                >>{'10/10/04': {'23': '22'}}
                >>
                >>Note that I use DOTALL to allow .* to cross line boundaries, MULTILINE
                >>to have ^ apply at the start of each line, and VERBOSE to allow me to
                >>write the re in a more readable form.
                >>
                >>If I didn't get your dict update quite right, hopefully you can see how
                >>to fix it![/color]
                >
                >
                > Thanks! That was very helpful. Unfortunately, I wasn't completely clear when
                > describing the problem. Is there anyway to extract multiple scores from the
                > same file and from multiple files[/color]

                I think if you use the non-greedy .*? instead of the greedy .*, you'll
                get this behavior. For example:

                py> s = """\
                .... Gibberish
                .... 53
                .... MoreGarbage
                [snip a whole bunch of stuff]
                .... RelevantInfo3
                .... 60
                .... Lalala
                .... """
                py> import re
                py> m = re.compile(r""" ^RelevantInfo1\ n([^\n]*)
                .... .*?
                .... ^RelevantInfo2\ n([^\n]*)
                .... .*?
                .... ^RelevantInfo3\ n([^\n]*)""",
                .... re.DOTALL | re.MULTILINE | re.VERBOSE)
                py> score = {}
                py> for info1, info2, info3 in m.findall(s):
                .... score.setdefaul t(info1, {})[info3] = info2
                ....
                py> score
                {'10/10/04': {'44': '33', '23': '22'}, '10/11/04': {'60': '45'}}

                If you might have multiple info2 values for the same (info1, info3)
                pair, you can try something like:

                py> score = {}
                py> for info1, info2, info3 in m.findall(s):
                .... score.setdefaul t(info1, {}).setdefault( info3, []).append(info2)
                ....
                py> score
                {'10/10/04': {'44': ['33'], '23': ['22']}, '10/11/04': {'60': ['45']}}

                HTH,

                STeVe

                Comment

                • Kent Johnson

                  #9
                  Re: Multiline regex help

                  Here is another attempt. I'm still not sure I understand what form you want the data in. I made a
                  dict -> dict -> list structure so if you lookup e.g. scores['10/11/04']['60'] you get a list of all
                  the RelevantInfo2 values for Relevant1='10/11/04' and Relevant2='60'.

                  The parser is a simple-minded state machine that will misbehave if the input does not have entries
                  in the order Relevant1, Relevant2, Relevant3 (with as many intervening lines as you like).

                  All three values are available when Relevant3 is detected so you could do something else with them
                  if you want.

                  HTH
                  Kent

                  import cStringIO

                  raw_data = '''Gibberish
                  53
                  MoreGarbage
                  12
                  RelevantInfo1
                  10/10/04
                  NothingImportan t
                  ThisDoesNotMatt er
                  44
                  RelevantInfo2
                  22
                  BlahBlah
                  343
                  RelevantInfo3
                  23
                  Hubris
                  Crap
                  34

                  Gibberish
                  53
                  MoreGarbage
                  12
                  RelevantInfo1
                  10/10/04
                  NothingImportan t
                  ThisDoesNotMatt er
                  44
                  RelevantInfo2
                  22
                  BlahBlah
                  343
                  RelevantInfo3
                  23
                  Hubris
                  Crap
                  34

                  SecondSetofGarb age
                  2423
                  YouGetThePictur e
                  342342
                  RelevantInfo1
                  10/10/04
                  HoHum
                  343
                  MoreStuffNotNee ded
                  232
                  RelevantInfo2
                  33
                  RelevantInfo3
                  44
                  sdfsdf
                  RelevantInfo1
                  10/11/04
                  InsertBoringFil lerHere
                  43234
                  Stuff
                  MoreStuff
                  RelevantInfo2
                  45
                  ExcitingIsntIt
                  324234
                  RelevantInfo3
                  60
                  Lalala'''
                  raw_data = cStringIO.Strin gIO(raw_data)

                  scores = {}
                  info1 = info2 = info3 = None

                  for line in raw_data:
                  if line.startswith ('RelevantInfo1 '):
                  info1 = raw_data.next() .strip()
                  elif line.startswith ('RelevantInfo2 '):
                  info2 = raw_data.next() .strip()
                  elif line.startswith ('RelevantInfo3 '):
                  info3 = raw_data.next() .strip()
                  scores.setdefau lt(info1, {}).setdefault( info3, []).append(info2)
                  info1 = info2 = info3 = None

                  print scores
                  print scores['10/11/04']['60']
                  print scores['10/10/04']['23']

                  ## prints:
                  {'10/10/04': {'44': ['33'], '23': ['22', '22']}, '10/11/04': {'60': ['45']}}
                  ['45']
                  ['22', '22']

                  Comment

                  • Yatima

                    #10
                    Re: Multiline regex help

                    On Thu, 03 Mar 2005 13:45:31 -0700, Steven Bethard <steven.bethard @gmail.com> wrote:[color=blue]
                    >
                    > I think if you use the non-greedy .*? instead of the greedy .*, you'll
                    > get this behavior. For example:
                    >
                    > py> s = """\
                    > ... Gibberish
                    > ... 53
                    > ... MoreGarbage
                    > [snip a whole bunch of stuff]
                    > ... RelevantInfo3
                    > ... 60
                    > ... Lalala
                    > ... """
                    > py> import re
                    > py> m = re.compile(r""" ^RelevantInfo1\ n([^\n]*)
                    > ... .*?
                    > ... ^RelevantInfo2\ n([^\n]*)
                    > ... .*?
                    > ... ^RelevantInfo3\ n([^\n]*)""",
                    > ... re.DOTALL | re.MULTILINE | re.VERBOSE)
                    > py> score = {}
                    > py> for info1, info2, info3 in m.findall(s):
                    > ... score.setdefaul t(info1, {})[info3] = info2
                    > ...
                    > py> score
                    > {'10/10/04': {'44': '33', '23': '22'}, '10/11/04': {'60': '45'}}
                    >
                    > If you might have multiple info2 values for the same (info1, info3)
                    > pair, you can try something like:
                    >
                    > py> score = {}
                    > py> for info1, info2, info3 in m.findall(s):
                    > ... score.setdefaul t(info1, {}).setdefault( info3, []).append(info2)
                    > ...
                    > py> score
                    > {'10/10/04': {'44': ['33'], '23': ['22']}, '10/11/04': {'60': ['45']}}
                    >[/color]
                    Perfect! Thank you so much. This is the behaviour I'm looking for. I will
                    fiddle around with this some more tonight but the rest should be okay.

                    Take care.

                    --
                    Of course power tools and alcohol don't mix. Everyone knows power
                    tools aren't soluble in alcohol...
                    -- Crazy Nigel

                    Comment

                    • Yatima

                      #11
                      Re: Multiline regex help

                      On Thu, 03 Mar 2005 16:25:39 -0500, Kent Johnson <kent37@tds.net > wrote:[color=blue]
                      > Here is another attempt. I'm still not sure I understand what form you want the data in. I made a
                      > dict -> dict -> list structure so if you lookup e.g. scores['10/11/04']['60'] you get a list of all
                      > the RelevantInfo2 values for Relevant1='10/11/04' and Relevant2='60'.
                      >
                      > The parser is a simple-minded state machine that will misbehave if the input does not have entries
                      > in the order Relevant1, Relevant2, Relevant3 (with as many intervening lines as you like).
                      >
                      > All three values are available when Relevant3 is detected so you could do something else with them
                      > if you want.
                      >
                      > HTH
                      > Kent
                      >
                      > import cStringIO
                      >
                      > raw_data = '''Gibberish
                      > 53
                      > MoreGarbage[/color]
                      [mass snippage][color=blue]
                      > 60
                      > Lalala'''
                      > raw_data = cStringIO.Strin gIO(raw_data)
                      >
                      > scores = {}
                      > info1 = info2 = info3 = None
                      >
                      > for line in raw_data:
                      > if line.startswith ('RelevantInfo1 '):
                      > info1 = raw_data.next() .strip()
                      > elif line.startswith ('RelevantInfo2 '):
                      > info2 = raw_data.next() .strip()
                      > elif line.startswith ('RelevantInfo3 '):
                      > info3 = raw_data.next() .strip()
                      > scores.setdefau lt(info1, {}).setdefault( info3, []).append(info2)
                      > info1 = info2 = info3 = None
                      >
                      > print scores
                      > print scores['10/11/04']['60']
                      > print scores['10/10/04']['23']
                      >
                      > ## prints:
                      > {'10/10/04': {'44': ['33'], '23': ['22', '22']}, '10/11/04': {'60': ['45']}}
                      > ['45']
                      > ['22', '22'][/color]

                      Thank you so much. Your solution and Steve's both give me what I'm looking
                      for. I appreciate both of your incredibly quick replies!

                      Take care.

                      --
                      You worry too much about your job. Stop it. You are not paid enough to worry.

                      Comment

                      • Yatima

                        #12
                        Re: Multiline regex help

                        On Thu, 3 Mar 2005 12:26:37 -0800, James Stroud <jstroud@mbi.uc la.edu> wrote:[color=blue]
                        > Have a look at "martel", part of biopython. The world of bioinformatics is
                        > filled with files with structure like this.
                        >
                        > http://www.biopython.org/docs/api/pu...el-module.html
                        >
                        > James[/color]

                        Thanks for the link. Steve and Kent have provided me with nice solutions but
                        I will check this out anyways for future referenced.

                        Take care.

                        --
                        You may easily play a joke on a man who likes to argue -- agree with him.
                        -- Ed Howe

                        Comment

                        • Steven Bethard

                          #13
                          Re: Multiline regex help

                          Kent Johnson wrote:[color=blue]
                          > for line in raw_data:
                          > if line.startswith ('RelevantInfo1 '):
                          > info1 = raw_data.next() .strip()
                          > elif line.startswith ('RelevantInfo2 '):
                          > info2 = raw_data.next() .strip()
                          > elif line.startswith ('RelevantInfo3 '):
                          > info3 = raw_data.next() .strip()
                          > scores.setdefau lt(info1, {}).setdefault( info3, []).append(info2)
                          > info1 = info2 = info3 = None[/color]

                          Very pretty. =) I have to say, I hadn't ever used iterators this way
                          before, that is, calling their next method from within a for-loop. I
                          like it. =)

                          Thanks for opening my mind. ;)

                          STeVe

                          Comment

                          • Kent Johnson

                            #14
                            Re: Multiline regex help

                            Steven Bethard wrote:[color=blue]
                            > Kent Johnson wrote:
                            >[color=green]
                            >> for line in raw_data:
                            >> if line.startswith ('RelevantInfo1 '):
                            >> info1 = raw_data.next() .strip()
                            >> elif line.startswith ('RelevantInfo2 '):
                            >> info2 = raw_data.next() .strip()
                            >> elif line.startswith ('RelevantInfo3 '):
                            >> info3 = raw_data.next() .strip()
                            >> scores.setdefau lt(info1, {}).setdefault( info3, []).append(info2)
                            >> info1 = info2 = info3 = None[/color]
                            >
                            >
                            > Very pretty. =) I have to say, I hadn't ever used iterators this way
                            > before, that is, calling their next method from within a for-loop. I
                            > like it. =)[/color]

                            I confess I have a nagging suspicion that someone who actually knows something about CPython
                            internals will tell me why it's a bad idea...but it sure is handy!
                            [color=blue]
                            > Thanks for opening my mind. ;)[/color]

                            My pleasure :-)

                            Kent

                            Comment

                            Working...