More regex help

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Support Desk

    More regex help

    I am working on a python webcrawler, that will extract all links from an
    html page, and add them to a queue, The problem I am having is building
    absolute links from relative links, as there are so many different types of
    relative links. If I just append the relative links to the current url, some
    websites will send it into a never-ending loop.

    What I am looking for is a regexp that will extract the root url from any
    url string I pass to it, such as

    'http://example.com/stuff/stuff/morestuff/index.html'

    Regexp = http:example.co m

    'http://anotherexample. com/stuff/index.php

    Regexp = 'http://anotherexample. com/

    'http://example.com/stuff/stuff/

    Regext = 'http://example.com'





  • Kirk Strauser

    #2
    Re: More regex help

    At 2008-09-24T16:25:02Z, "Support Desk" <support.desk.i pg@gmail.comwri tes:
    I am working on a python webcrawler, that will extract all links from an
    html page, and add them to a queue, The problem I am having is building
    absolute links from relative links, as there are so many different types of
    relative links. If I just append the relative links to the current url, some
    websites will send it into a never-ending loop.
    >>import urllib
    >>urllib.basejo in('http://www.example.com/path/to/deep/page',
    '/foo')
    'http://www.example.com/foo'
    >>urllib.basejo in('http://www.example.com/path/to/deep/page',
    'http://slashdot.org/foo')
    'http://slashdot.org/foo'

    --
    Kirk Strauser
    The Day Companies

    Comment

    • Support Desk

      #3
      RE: More regex help

      Kirk,

      That's exactly what I needed. Thx!


      -----Original Message-----
      From: Kirk Strauser [mailto:kirk@ath ena.daycos.com]
      Sent: Wednesday, September 24, 2008 11:42 AM
      To: python-list@python.org
      Subject: Re: More regex help

      At 2008-09-24T16:25:02Z, "Support Desk" <support.desk.i pg@gmail.comwri tes:
      I am working on a python webcrawler, that will extract all links from an
      html page, and add them to a queue, The problem I am having is building
      absolute links from relative links, as there are so many different types
      of
      relative links. If I just append the relative links to the current url,
      some
      websites will send it into a never-ending loop.
      >>import urllib
      >>urllib.basejo in('http://www.example.com/path/to/deep/page',
      '/foo')
      'http://www.example.com/foo'
      >>urllib.basejo in('http://www.example.com/path/to/deep/page',
      'http://slashdot.org/foo')
      'http://slashdot.org/foo'

      --
      Kirk Strauser
      The Day Companies


      Comment

      Working...