HTMLDocument and Xpath

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • swilson@acs.on.ca

    HTMLDocument and Xpath

    Hi, I want to use xpath to scrape info from a website using pyXML but I
    keep getting no results.

    For example, in the following, I want to return the text "Element1" I
    can't get xpath to return anything at all. What's wrong with this
    code?

    --------------------
    from xml.dom.ext.rea der import HtmlLib
    from xml.xpath import Evaluate

    reader = HtmlLib.Reader( )
    doc_node = reader.fromStri ng("""
    <html>
    <head>
    <title>Python Programming Language</title>
    </head>
    <body>
    <table><tr><td> element1</td></tr></table>
    </body>
    </html>
    """)

    test = Evaluate('td', doc_node.docume ntElement)
    print "test =", test
    ------------

    All I get is an empty list for output.

    Thx in advance

    Shawn

  • Alan Kennedy

    #2
    Re: HTMLDocument and Xpath

    [swilson@acs.on. ca][color=blue]
    > Hi, I want to use xpath to scrape info from a website using pyXML but I
    > keep getting no results.
    >
    > For example, in the following, I want to return the text "Element1" I
    > can't get xpath to return anything at all. What's wrong with this
    > code?[/color]

    Your xpath expression is wrong.
    [color=blue]
    > test = Evaluate('td', doc_node.docume ntElement)[/color]

    Try one of the following alternatives, all of which should work.

    test = Evaluate('//td', doc_node.docume ntElement)
    test = Evaluate('/html/body/table/tr/td', doc_node.docume ntElement)
    test = Evaluate('/html/body/table/tr/td[1]', doc_node.docume ntElement)

    HTH,

    Alan.

    Comment

    • swilson@acs.on.ca

      #3
      Re: HTMLDocument and Xpath


      Alan Kennedy wrote:[color=blue]
      > [swilson@acs.on. ca][color=green]
      > > Hi, I want to use xpath to scrape info from a website using pyXML but I
      > > keep getting no results.
      > >
      > > For example, in the following, I want to return the text "Element1" I
      > > can't get xpath to return anything at all. What's wrong with this
      > > code?[/color]
      >
      > Your xpath expression is wrong.
      >[color=green]
      > > test = Evaluate('td', doc_node.docume ntElement)[/color]
      >
      > Try one of the following alternatives, all of which should work.
      >
      > test = Evaluate('//td', doc_node.docume ntElement)
      > test = Evaluate('/html/body/table/tr/td', doc_node.docume ntElement)
      > test = Evaluate('/html/body/table/tr/td[1]', doc_node.docume ntElement)
      >
      > HTH,
      >
      > Alan.[/color]

      I tried all of those and in every case, test returns "[]". Does
      Evaluate only work with XML documents?

      Shawn

      Comment

      • swilson@acs.on.ca

        #4
        Re: HTMLDocument and Xpath

        Got the answer - there's a bug in xpath. I think the HTML parser
        converts all the tags (but not the attributes) to uppercase. Xpath
        definitely does not like my first string but, these work fine:

        test = Evaluate('//TD', doc_node.docume ntElement)
        test = Evaluate('/HTML/BODY/TABLE/TR/TD', doc_node.docume ntElement)
        test = Evaluate('/HTML/BODY/TABLE/TR/TD[1]', doc_node.docume ntElement)

        Shawn

        Comment

        Working...