DOM with HTML

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Alessio Pace

    DOM with HTML

    Hi, I need to get a sort of DOM from an HTML page that is declared as XHTML
    but unfortunately is *not* xhtml valid.. If I try to parse it with
    xml.dom.minidom I get error with expat (as I supposed), so I was told to
    try in this way, with a "forgiving" html parser:

    from xml.dom.ext.rea der import HtmlLib
    reader = HtmlLib.Reader( )
    dom = reader.fromUri( url) # 'url' the web page

    FIRST ISSUE:
    It seemed to me, reading the source code in
    $MY_PYTHON_INST ALLATION_DIR/site-packages/_xmlplus/dom/ext/reader/ ,
    that these are 4DOM APIs , so from what I know of python distributions, they
    are extra packages, or not? I would like to use *only* libs that are
    available in the python2.2 suite, not any extra.

    SECOND ISSUE:
    If the above libs were included in python (and so I would continue using
    them), how do I print a string representation of a (sub) tree of the DOM? I
    tried with .toxml() as in the XML tutorial but that method does not exist
    for the FtNode objects that are involved there... Any idea??

    Thanks so much for who can help me

    --
    bye
    Alessio Pace
  • F. GEIGER

    #2
    Re: DOM with HTML

    > Hi, I need to get a sort of DOM from an HTML page that is declared as
    XHTML[color=blue]
    > but unfortunately is *not* xhtml valid.. If I try to parse it with[/color]

    I use mx.Tidy in such cases, with great success.

    Cheers
    Franz


    "Alessio Pace" <puccio_13@yaho o.it> schrieb im Newsbeitrag
    news:3GbMa.4404 $FI4.118833@tor nado.fastwebnet .it...[color=blue]
    > Hi, I need to get a sort of DOM from an HTML page that is declared as[/color]
    XHTML[color=blue]
    > but unfortunately is *not* xhtml valid.. If I try to parse it with
    > xml.dom.minidom I get error with expat (as I supposed), so I was told to
    > try in this way, with a "forgiving" html parser:
    >
    > from xml.dom.ext.rea der import HtmlLib
    > reader = HtmlLib.Reader( )
    > dom = reader.fromUri( url) # 'url' the web page
    >
    > FIRST ISSUE:
    > It seemed to me, reading the source code in
    > $MY_PYTHON_INST ALLATION_DIR/site-packages/_xmlplus/dom/ext/reader/ ,
    > that these are 4DOM APIs , so from what I know of python distributions,[/color]
    they[color=blue]
    > are extra packages, or not? I would like to use *only* libs that are
    > available in the python2.2 suite, not any extra.
    >
    > SECOND ISSUE:
    > If the above libs were included in python (and so I would continue using
    > them), how do I print a string representation of a (sub) tree of the DOM?[/color]
    I[color=blue]
    > tried with .toxml() as in the XML tutorial but that method does not exist
    > for the FtNode objects that are involved there... Any idea??
    >
    > Thanks so much for who can help me
    >
    > --
    > bye
    > Alessio Pace[/color]


    Comment

    Working...