RE: python screen scraping/parsing

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • bruce

    RE: python screen scraping/parsing

    Hi Paul...

    Thanks for the reply. Came to the same conclusion a few minutes before I saw
    your email.

    Another question:

    tr=d.xpath(foo)

    gets me an array of nodes.

    is there a way for me to then iterate through the node tr[x] to see if a
    child node exists???

    "d" is a document object, while "tr" would be a node object?, or would i
    convert the "tr[x]" to a string, and then feed that into the
    libxml2dom.pars eString()...


    thanks



    -----Original Message-----
    From: python-list-bounces+bedougl as=earthlink.ne t@python.org
    [mailto:python-list-bounces+bedougl as=earthlink.ne t@python.org]On Behalf
    Of Paul Boddie
    Sent: Friday, June 13, 2008 12:49 PM
    To: python-list@python.org
    Subject: Re: python screen scraping/parsing


    On 13 Jun, 20:10, "bruce" <bedoug...@eart hlink.netwrote:
    >
    url ="http://www.pricegrabbe r.com/rating_summary. php/page=1"
    [...]
    tr =
    >
    "/html/body/div[@id='pgSiteCont ainer']/div[@id='pgPageCont ent']/table[2]/tbo
    dy/tr[4]"
    >
    tr_=d.xpath(tr)
    [...]
    my issue appears to be related to the last "tbody", or tbody/tr[4]...
    >
    if i leave off the tbody, i can display data, as the tr_ is an array with
    data...
    Yes, I can confirm this.
    with the "tbody" it appears that the tr_ array is not defined, or it has
    no
    data... however, i can use the DOM tool with firefox to observe the fact
    that the "tbody" is there...
    Yes, but the DOM tool in Firefox probably inserts virtual nodes for
    its own purposes. Remember that it has to do a lot of other stuff like
    implement CSS rendering and DOM event models.

    You can confirm that there really is no tbody by printing the result
    of this...

    d.xpath("/html/body/div[@id='pgSiteCont ainer']/
    div[@id='pgPageCont ent']/table[2]")[0].toString()

    This should fetch the second table in a single element list and then
    obviously give you the only element of that list. You'll see that the
    raw HTML doesn't have any tbody tags at all.

    Paul
    --


  • Paul Boddie

    #2
    Re: python screen scraping/parsing

    On 13 Jun, 23:09, "bruce" <bedoug...@eart hlink.netwrote:
    >
    Thanks for the reply. Came to the same conclusion a few minutes before I saw
    your email.
    >
    Another question:
    >
    tr=d.xpath(foo)
    >
    gets me an array of nodes.
    >
    is there a way for me to then iterate through the node tr[x] to see if a
    child node exists???
    You can always use the DOM or perform another XPath query:

    for node in tr[x].childNodes:
    <do something with node>

    for node in tr[x].xpath(some_oth er_query_inside _tr):
    <do something with node>
    "d" is a document object, while "tr" would be a node object?, or would i
    convert the "tr[x]" to a string, and then feed that into the
    libxml2dom.pars eString()...
    There's no need to parse anything again: just use the methods on the
    object that tr[x] produces, including the xpath method, of course.
    Remember that the document object is just a special node object, so
    most of the methods are available on both. If in doubt, run your
    program using Python's -i option and then inspect the objects at the
    interactive prompt.

    Paul

    Comment

    Working...