More regex help

**Kirk Strauser** · Sep 24 '08, 04:55 PM

Re: More regex help

At 2008-09-24T16:25:02Z, "Support Desk" <support.desk.i pg@gmail.comwri tes:

I am working on a python webcrawler, that will extract all links from an
html page, and add them to a queue, The problem I am having is building
absolute links from relative links, as there are so many different types of
relative links. If I just append the relative links to the current url, some
websites will send it into a never-ending loop.

>>import urllib
>>urllib.basejo in('http://www.example.com/path/to/deep/page',

'/foo')
'http://www.example.com/foo'

>>urllib.basejo in('http://www.example.com/path/to/deep/page',

'http://slashdot.org/foo')
'http://slashdot.org/foo'

--
Kirk Strauser
The Day Companies

**Support Desk** · Sep 24 '08, 07:25 PM

RE: More regex help

Kirk,

That's exactly what I needed. Thx!

-----Original Message-----
From: Kirk Strauser [mailto:kirk@ath ena.daycos.com]
Sent: Wednesday, September 24, 2008 11:42 AM
To: python-list@python.org
Subject: Re: More regex help

At 2008-09-24T16:25:02Z, "Support Desk" <support.desk.i pg@gmail.comwri tes:

I am working on a python webcrawler, that will extract all links from an
html page, and add them to a queue, The problem I am having is building
absolute links from relative links, as there are so many different types

of

relative links. If I just append the relative links to the current url,

some

websites will send it into a never-ending loop.

>>import urllib
>>urllib.basejo in('http://www.example.com/path/to/deep/page',

'/foo')
'http://www.example.com/foo'

>>urllib.basejo in('http://www.example.com/path/to/deep/page',

'http://slashdot.org/foo')
'http://slashdot.org/foo'

--
Kirk Strauser
The Day Companies

More regex help

More regex help

Comment

Comment