Regular Expression Help, getting over the newline \n

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • BLaw
    New Member
    • May 2007
    • 3

    Regular Expression Help, getting over the newline \n

    Hello all,

    I am trying to parse an HTML file but everytime I bump into the newline character my regex stops. How do I hit the newline, skip it, and then continue grabbing text until the next paragraph starts? When I try re.DOTALL it is too greedy and grabs the paragraph dividers as well. Thanks so much!
    Code:
    # Sample HTML text:
    text = '<p>&nbsp;&nbsp;&nbsp;We operate forever. \nWe will become Representatives. \n<p>&nbsp;&nbsp;&nbsp; Any conference'
    
    # My regex:
    results = open("results.txt","a")  
    speechPattern = re.compile(r'''
    <p>&nbsp;&nbsp;&nbsp;   
    (.*)
    ''', re.VERBOSE)        
    test = speechPattern.findall(text)
    results.writelines(test)
    results.close()
    Thanks again!

    Law
    Last edited by bartonc; May 7 '07, 05:18 PM. Reason: added [code][/code] tags
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    Originally posted by BLaw
    Hello all,

    I am trying to parse an HTML file but everytime I bump into the newline character my regex stops. How do I hit the newline, skip it, and then continue grabbing text until the next paragraph starts? When I try re.DOTALL it is too greedy and grabs the paragraph dividers as well. Thanks so much!
    Code:
    # Sample HTML text:
    text = '<p>&nbsp;&nbsp;&nbsp;We operate forever. \nWe will become Representatives. \n<p>&nbsp;&nbsp;&nbsp; Any conference'
    
    # My regex:
    results = open("results.txt","a")  
    speechPattern = re.compile(r'''
    <p>&nbsp;&nbsp;&nbsp;   
    (.*)
    ''', re.VERBOSE)        
    test = speechPattern.findall(text)
    results.writelines(test)
    results.close()
    Thanks again!

    Law
    If you must have a regex solution, this will not help:
    Code:
    >>> text = '<p>&nbsp;&nbsp;&nbsp;We operate forever. \nWe will become Representatives. \n<p>&nbsp;&nbsp;&nbsp; Any conference'
    >>> [s.strip() for s in text.replace('\n', '').split('<p>&nbsp;&nbsp;&nbsp;') if s != '']
    ['We operate forever. We will become Representatives.', 'Any conference']
    >>>

    Comment

    • BLaw
      New Member
      • May 2007
      • 3

      #3
      Originally posted by bvdet
      If you must have a regex solution, this will not help:
      Code:
      >>> text = '<p>&nbsp;&nbsp;&nbsp;We operate forever. \nWe will become Representatives. \n<p>&nbsp;&nbsp;&nbsp; Any conference'
      >>> [s.strip() for s in text.replace('\n', '').split('<p>&nbsp;&nbsp;&nbsp;') if s != '']
      ['We operate forever. We will become Representatives.', 'Any conference']
      >>>
      A great example of my beginner's eyes not seeing a better way; thanks so much!

      Comment

      • ghostdog74
        Recognized Expert Contributor
        • Apr 2006
        • 511

        #4
        to match newline over multilines, in your re.compile() statement, add re.DOTALL | re.M
        eg
        re.compile("reg exp", re.DOTALL|re.M)

        Comment

        • bartonc
          Recognized Expert Expert
          • Sep 2006
          • 6478

          #5
          Originally posted by ghostdog74
          to match newline over multilines, in your re.compile() statement, add re.DOTALL | re.M
          eg
          re.compile("reg exp", re.DOTALL|re.M)
          This is worth a second look, so I'm bumping the thread. I'm studying regular expressions at the moment, and it bugged me that I didn't know the answer. Then, out of the blue, while reading Mastering Regular Expressons, it came to me {Python has a DOTALL flag}. Thanks, GD, for you succinct expertise on this matter.

          Comment

          • ghostdog74
            Recognized Expert Contributor
            • Apr 2006
            • 511

            #6
            Originally posted by bartonc
            This is worth a second look, so I'm bumping the thread. I'm studying regular expressions at the moment, and it bugged me that I didn't know the answer. Then, out of the blue, while reading Mastering Regular Expressons, it came to me {Python has a DOTALL flag}. Thanks, GD, for you succinct expertise on this matter.
            hey bc no prob...:) yup that book is good.

            Comment

            Working...