scraping question

**bvdet** · Oct 2 '07, 11:55 PM

Originally posted by Patrick C

hey everyone, here's probably an easy qu but i'm new to this...

i'm doing a web scrabe and the line i want in the source code looks like this:

Code:

<td>Return on Average Equity</td>
<td align=right>
16.08%
</td>
<td align=right>
10.58%
</td>
<td align=right>
11.71%
</td>

The number i want to get out is 10.58. However, that number will change so i can't just search for it. Any ideas how i should go about this?

I was hoping there might be a way to look for "<td align=right>" then say get the next line.

any thoughts.

thanks

This should work, but you probably need a way to terminate the for loop. What kind of data follows?[code=Python]import re

patt = re.compile(r'(R eturn on Average Equity)')
fn = 'test.txt'

f = open(fn)

# skip to 'Return on Average Equity'
s = f.next()
while not patt.search(s):
s = f.next()

returnList = []
for line in f:
if '<td align=right>' in line:
returnList.appe nd(f.next().str ip())

print returnList[/code]

>>> ['16.08%', '10.58%', '11.71%']
>>>

**Patrick C** · Oct 3 '07, 02:47 PM

When I try your method I get an error that is like this...

Code:

>>> s = f.next()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
StopIteration
>>>

Also, if i plan to do this a few hundred/thousand times...would I need to create a few hundred/thousand text.txt files before hand?

Thanks

**bvdet** · Oct 3 '07, 04:37 PM

Originally posted by Patrick C

When I try your method I get an error that is like this...

Code:

>>> s = f.next()
Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
StopIteration
>>>

Also, if i plan to do this a few hundred/thousand times...would I need to create a few hundred/thousand text.txt files before hand?

Thanks

If you are scraping a website, you may be using the urllib module. Your code may look something like this:[code=Python]f = urllib.urlopen( 'http://www.bvdetailing .com')[/code]'f' is a file like object on which you can iterate. The file method next() is similar to readline(). A StopIteration error will be raised when the end of file is reached. Example:[code=Python]>>> import urllib
>>> f = urllib.urlopen( 'http://www.bvdetailing .com')
>>> f.next()
'<html><head><m eta http-equiv="Content-Type" content="text/html; charset=win.... .........
>>> s = f.read()
>>> f.next()
Traceback (most recent call last):
File "<interacti ve input>", line 1, in ?
File "C:\Python23\li b\socket.py", line 405, in next
raise StopIteration
StopIteration
>>> [/code]There is no need to create a disc file. Consider using an HTML parser to get the data you need.

scraping question

scraping question

Comment

Comment

Comment