scraping question

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Patrick C
    New Member
    • Apr 2007
    • 54

    scraping question

    hey everyone, here's probably an easy qu but i'm new to this...

    i'm doing a web scrabe and the line i want in the source code looks like this:

    Code:
    <td>Return on Average Equity</td>
    <td align=right>
    16.08%
    </td>
    <td align=right>
    10.58%
    </td>
    <td align=right>
    11.71%
    </td>
    The number i want to get out is 10.58. However, that number will change so i can't just search for it. Any ideas how i should go about this?

    I was hoping there might be a way to look for "<td align=right>" then say get the next line.

    any thoughts.

    thanks
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    Originally posted by Patrick C
    hey everyone, here's probably an easy qu but i'm new to this...

    i'm doing a web scrabe and the line i want in the source code looks like this:

    Code:
    <td>Return on Average Equity</td>
    <td align=right>
    16.08%
    </td>
    <td align=right>
    10.58%
    </td>
    <td align=right>
    11.71%
    </td>
    The number i want to get out is 10.58. However, that number will change so i can't just search for it. Any ideas how i should go about this?

    I was hoping there might be a way to look for "<td align=right>" then say get the next line.

    any thoughts.

    thanks
    This should work, but you probably need a way to terminate the for loop. What kind of data follows?[code=Python]import re

    patt = re.compile(r'(R eturn on Average Equity)')
    fn = 'test.txt'

    f = open(fn)

    # skip to 'Return on Average Equity'
    s = f.next()
    while not patt.search(s):
    s = f.next()

    returnList = []
    for line in f:
    if '<td align=right>' in line:
    returnList.appe nd(f.next().str ip())

    print returnList[/code]

    >>> ['16.08%', '10.58%', '11.71%']
    >>>

    Comment

    • Patrick C
      New Member
      • Apr 2007
      • 54

      #3
      When I try your method I get an error that is like this...
      Code:
      >>> s = f.next()
      Traceback (most recent call last):
        File "<interactive input>", line 1, in <module>
      StopIteration
      >>>
      Also, if i plan to do this a few hundred/thousand times...would I need to create a few hundred/thousand text.txt files before hand?

      Thanks

      Comment

      • bvdet
        Recognized Expert Specialist
        • Oct 2006
        • 2851

        #4
        Originally posted by Patrick C
        When I try your method I get an error that is like this...
        Code:
        >>> s = f.next()
        Traceback (most recent call last):
          File "<interactive input>", line 1, in <module>
        StopIteration
        >>>
        Also, if i plan to do this a few hundred/thousand times...would I need to create a few hundred/thousand text.txt files before hand?

        Thanks
        If you are scraping a website, you may be using the urllib module. Your code may look something like this:[code=Python]f = urllib.urlopen( 'http://www.bvdetailing .com')[/code]'f' is a file like object on which you can iterate. The file method next() is similar to readline(). A StopIteration error will be raised when the end of file is reached. Example:[code=Python]>>> import urllib
        >>> f = urllib.urlopen( 'http://www.bvdetailing .com')
        >>> f.next()
        '<html><head><m eta http-equiv="Content-Type" content="text/html; charset=win.... .........
        >>> s = f.read()
        >>> f.next()
        Traceback (most recent call last):
        File "<interacti ve input>", line 1, in ?
        File "C:\Python23\li b\socket.py", line 405, in next
        raise StopIteration
        StopIteration
        >>> [/code]There is no need to create a disc file. Consider using an HTML parser to get the data you need.

        Comment

        Working...