Regular expression help

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • nclbndk759@googlemail.com

    Regular expression help

    Hello,

    I am new to Python, with a background in scientific computing. I'm
    trying to write a script that will take a file with lines like

    c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107 emad=-3.597647
    3pv=0

    extract the values of afrac and etot and plot them. I'm really
    struggling with getting the values of efrac and etot. So far I have
    come up with (small snippet of script just to get the energy, etot):

    def get_data_points (filename):
    file = open(filename,' r')
    data_points = []
    while 1:
    line = file.readline()
    if not line: break
    energy = get_total_energ y(line)
    data_points.app end(energy)
    return data_points

    def get_total_energ y(line):
    rawstr = r"""(?P<key>.*? )=(?P<value>.*? )\s"""
    p = re.compile(raws tr)
    return p.match(line,5)

    What is being stored in energy is '<_sre.SRE_Matc h object at
    0x2a955e4ed0>', not '-11.020107'. Why? I've been struggling with
    regular expressions for two days now, with no luck. Could someone
    please put me out of my misery and give me a clue as to what's going
    on? Apologies if it's blindingly obvious or if this question has been
    asked and answered before.

    Thanks,

    Nicole
  • Russell Blau

    #2
    Re: Regular expression help

    <nclbndk759@goo glemail.comwrot e in message
    news:187a9118-97fb-41fc-87f2-d24c4d7522fb@w1 g2000prk.google groups.com...
    I am new to Python, with a background in scientific computing. I'm
    trying to write a script that will take a file with lines like
    >
    c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107 emad=-3.597647
    3pv=0
    >
    extract the values of afrac and etot and plot them.
    ....
    What is being stored in energy is '<_sre.SRE_Matc h object at
    0x2a955e4ed0>', not '-11.020107'. Why?
    because the re.match() method returns a match object, as documented at


    But this looks like a problem where regular expressions are overkill.
    Assuming all your lines are formatted as in the example above (every value
    you are interested in contains an equals sign and is surrounded by spaces),
    you could do this:

    values = {}
    for expression in line.split(" "):
    if "=" in expression:
    name, val = expression.spli t("=")
    values[name] = val

    I'd wager that this will run a fair bit faster than any regex-based
    solution. Then you just use values['afrac'] and values['etot'] when you
    need them.

    And when you get to be a really hard-core Pythonista, you could write the
    whole routine above in one line, but this seems clearer. ;-)

    Russ



    Comment

    • Brad

      #3
      Re: Regular expression help

      nclbndk759@goog lemail.com wrote:
      Hello,
      >
      I am new to Python, with a background in scientific computing. I'm
      trying to write a script that will take a file with lines like
      >
      c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107 emad=-3.597647
      3pv=0
      >
      extract the values of afrac and etot...
      Why not just split them out instead of using REs?

      fp = open("test.txt" )
      lines = fp.readlines()
      fp.close()

      for line in lines:
      split = line.split()
      for pair in split:
      pair_split = pair.split("=")
      if len(pair_split) == 2:
      try:
      print pair_split[0], "is", pair_split[1]
      except:
      pass

      Results:

      IDLE 1.2.2 ==== No Subprocess ====
      >>>
      afrac is .7
      mmom is 0
      sev is -9.56646
      erep is 0
      etot is -11.020107
      emad is -3.597647
      3pv is 0
      >>>

      Comment

      • Gerard flanagan

        #4
        Re: Regular expression help

        nclbndk759@goog lemail.com wrote:
        Hello,
        >
        I am new to Python, with a background in scientific computing. I'm
        trying to write a script that will take a file with lines like
        >
        c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107 emad=-3.597647
        3pv=0
        >
        extract the values of afrac and etot and plot them. I'm really
        struggling with getting the values of efrac and etot. So far I have
        come up with (small snippet of script just to get the energy, etot):
        >
        def get_data_points (filename):
        file = open(filename,' r')
        data_points = []
        while 1:
        line = file.readline()
        if not line: break
        energy = get_total_energ y(line)
        data_points.app end(energy)
        return data_points
        >
        def get_total_energ y(line):
        rawstr = r"""(?P<key>.*? )=(?P<value>.*? )\s"""
        p = re.compile(raws tr)
        return p.match(line,5)
        >
        What is being stored in energy is '<_sre.SRE_Matc h object at
        0x2a955e4ed0>', not '-11.020107'. Why?


        1. Consider using the 'split' method on each line rather than regexes
        2. In your code you are compiling the regex for every line in the file,
        you should lift it out of the 'get_total-energy' function so that the
        compilation is only done once.
        3. A Match object has a 'groups' function which is what you need to
        retrieve the data
        4. Also look at the findall method:

        data = 'c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107
        emad=-3.597647 3pv=0 '

        import re

        rx = re.compile(r'(\ w+)=(\S+)')

        data = dict(rx.findall (data))

        print data

        hth

        G.

        Comment

        • Nick Dumas

          #5
          Re: Regular expression help

          -----BEGIN PGP SIGNED MESSAGE-----
          Hash: SHA1

          I think you're over-complicating this. I'm assuming that you're going to
          do a line graph of some sorta, and each new line of the file contains a
          new set of data.

          The problem you mentioned with your regex returning a match object
          rather than a string is because you're simply using a re function that
          doesn't return strings. re.findall() is what you want. That being said,
          here is working code to mine data from your file.

          Code:
          line = 'c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107
          mad=-3.597647 3pv=0'
          
          energypat = r'\betot=(-?\d*?[.]\d*)'
          
          #Note: To change the data grabbed from the line, you can change the
          #'etot' to 'afrac' or 'emad' or anything that doesn't contain a regex
          #special character.
          
          energypat = re.compile(energypat)
          
          re.findall(energypat, line)# returns a STRING containing '-12.020107'
          This returns a string, which is easy enough to convert to an int. After
          that, you can datapoints.appe nd() to your heart's content. Good luck
          with your work.

          nclbndk759@goog lemail.com wrote:
          Hello,
          >
          I am new to Python, with a background in scientific computing. I'm
          trying to write a script that will take a file with lines like
          >
          c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107 emad=-3.597647
          3pv=0
          >
          extract the values of afrac and etot and plot them. I'm really
          struggling with getting the values of efrac and etot. So far I have
          come up with (small snippet of script just to get the energy, etot):
          >
          def get_data_points (filename):
          file = open(filename,' r')
          data_points = []
          while 1:
          line = file.readline()
          if not line: break
          energy = get_total_energ y(line)
          data_points.app end(energy)
          return data_points
          >
          def get_total_energ y(line):
          rawstr = r"""(?P<key>.*? )=(?P<value>.*? )\s"""
          p = re.compile(raws tr)
          return p.match(line,5)
          >
          What is being stored in energy is '<_sre.SRE_Matc h object at
          0x2a955e4ed0>', not '-11.020107'. Why? I've been struggling with
          regular expressions for two days now, with no luck. Could someone
          please put me out of my misery and give me a clue as to what's going
          on? Apologies if it's blindingly obvious or if this question has been
          asked and answered before.
          >
          Thanks,
          >
          Nicole
          -----BEGIN PGP SIGNATURE-----
          Version: GnuPG v1.4.9 (MingW32)
          Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

          iEYEARECAAYFAki AqiAACgkQLMI5fn dAv9h7HgCfU6a7v 1nE5iLYcUPbXhC6 sfU7
          mpkAn1Q/DyOI4Zo7QJhF9zq fqCq6boXv
          =L2VZ
          -----END PGP SIGNATURE-----

          Comment

          • nclbndk759@googlemail.com

            #6
            Re: Regular expression help

            On Jul 18, 3:35 pm, Nick Dumas <drako...@gmail .comwrote:
            -----BEGIN PGP SIGNED MESSAGE-----
            Hash: SHA1
            >
            I think you're over-complicating this. I'm assuming that you're going to
            do a line graph of some sorta, and each new line of the file contains a
            new set of data.
            >
            The problem you mentioned with your regex returning a match object
            rather than a string is because you're simply using a re function that
            doesn't return strings. re.findall() is what you want. That being said,
            here is working code to mine data from your file.
            >
            Code:
            line = 'c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107
            mad=-3.597647 3pv=0'
            >
            energypat = r'\betot=(-?\d*?[.]\d*)'
            >
            #Note: To change the data grabbed from the line, you can change the
            #'etot' to 'afrac' or 'emad' or anything that doesn't contain a regex
            #special character.
            >
            energypat = re.compile(energypat)
            >
            re.findall(energypat, line)# returns a STRING containing '-12.020107'
            >
            >
            This returns a string, which is easy enough to convert to an int. After
            that, you can datapoints.appe nd() to your heart's content. Good luck
            with your work.
            >
            >
            >
            nclbndk...@goog lemail.com wrote:
            Hello,
            >
            I am new to Python, with a background in scientific computing. I'm
            trying to write a script that will take a file with lines like
            >
            c afrac=.7 mmom=0 sev=-9.56646 erep=0 etot=-11.020107 emad=-3.597647
            3pv=0
            >
            extract the values of afrac and etot and plot them. I'm really
            struggling with getting the values of efrac and etot. So far I have
            come up with (small snippet of script just to get the energy, etot):
            >
            def get_data_points (filename):
                file = open(filename,' r')
                data_points = []
                while 1:
                    line = file.readline()
                    if not line: break
                    energy = get_total_energ y(line)
                    data_points.app end(energy)
                return data_points
            >
            def get_total_energ y(line):
                rawstr = r"""(?P<key>.*? )=(?P<value>.*? )\s"""
                p = re.compile(raws tr)
                return p.match(line,5)
            >
            What is being stored in energy is '<_sre.SRE_Matc h object at
            0x2a955e4ed0>', not '-11.020107'. Why? I've been struggling with
            regular expressions for two days now, with no luck. Could someone
            please put me out of my misery and give me a clue as to what's going
            on? Apologies if it's blindingly obvious or if this question has been
            asked and answered before.
            >
            Thanks,
            >
            Nicole
            >
            -----BEGIN PGP SIGNATURE-----
            Version: GnuPG v1.4.9 (MingW32)
            Comment: Using GnuPG with Mozilla -http://enigmail.mozdev .org
            >
            iEYEARECAAYFAki AqiAACgkQLMI5fn dAv9h7HgCfU6a7v 1nE5iLYcUPbXhC6 sfU7
            mpkAn1Q/DyOI4Zo7QJhF9zq fqCq6boXv
            =L2VZ
            -----END PGP SIGNATURE-----
            Thanks guys :-)

            Comment

            • Marc 'BlackJack' Rintsch

              #7
              Re: Regular expression help

              On Fri, 18 Jul 2008 10:04:29 -0400, Russell Blau wrote:
              values = {}
              for expression in line.split(" "):
              if "=" in expression:
              name, val = expression.spli t("=")
              values[name] = val
              […]
              >
              And when you get to be a really hard-core Pythonista, you could write
              the whole routine above in one line, but this seems clearer. ;-)
              I know it's a matter of taste but I think the one liner is still clear
              (enough)::

              values = dict(s.split('= ') for s in line.split() if '=' in s)

              Ciao,
              Marc 'BlackJack' Rintsch

              Comment

              Working...