Regex Help

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Support Desk

    Regex Help

    Anybody know of a good regex to parse html links from html code? The one I
    am currently using seems to be cutting off the last letter of some links,
    and returning links like



    or http://somesite.ph

    the code I am using is


    regex = r'<a href=["|\']([^"|\']+)["|\']>'

    page_text = urllib.urlopen( 'http://somesite.com')
    page_text = page_text.read( )

    links = re.findall(rege x, text, re.IGNORECASE)



  • Miki

    #2
    Re: Regex Help

    Hello,
    Anybody know of a good regex to parse html links from html code?
    BeautifulSoup is *the* library to handle HTML

    from BeautifulSoup import BeautifulSoup
    from urllib import urlopen

    soup = BeautifulSoup(u rlopen("http://python.org/"))
    for a in soup("a"):
    print a["href"]

    HTH,
    --
    Miki <miki.tebeka@gm ail.com>
    If it won't be simple, it simply won't be. [Hire me, source code]

    Comment

    • Lawrence D'Oliveiro

      #3
      Re: Regex Help

      In message <mailman.1369.1 222101506.3487. python-list@python.org >, Support
      Desk wrote:
      Anybody know of a good regex to parse html links from html code? The one I
      am currently using seems to be cutting off the last letter of some links,
      and returning links like
      >

      >
      or http://somesite.ph
      >
      the code I am using is
      >
      >
      regex = r'<a href=["|\']([^"|\']+)["|\']>'
      Can you post some example HTML sequences that this regexp is not handling
      correctly?

      Comment

      • Support Desk

        #4
        RE: Regex Help


        Thanks for the reply, I found out the problem was occurring later on in the
        script. The regexp works well.

        -----Original Message-----
        From: Lawrence D'Oliveiro [mailto:ldo@geek-central.gen.new _zealand]
        Sent: Tuesday, September 23, 2008 6:51 PM
        To: python-list@python.org
        Subject: Re: Regex Help

        In message <mailman.1369.1 222101506.3487. python-list@python.org >, Support
        Desk wrote:
        Anybody know of a good regex to parse html links from html code? The one I
        am currently using seems to be cutting off the last letter of some links,
        and returning links like
        >

        >
        or http://somesite.ph
        >
        the code I am using is
        >
        >
        regex = r'<a href=["|\']([^"|\']+)["|\']>'
        Can you post some example HTML sequences that this regexp is not handling
        correctly?


        Comment

        • Lawrence D'Oliveiro

          #5
          RE: Regex Help

          In message <mailman.1450.1 222266191.3487. python-list@python.org >, Support
          Desk wrote:
          Thanks for the reply ...
          A: The vulture doesn't get Frequent Poster miles.
          Q: What's the difference between a top-poster and a vulture?

          Comment

          Working...