Re: parsing javascript

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Philip Semanchuk

    Re: parsing javascript


    On Oct 12, 2008, at 5:25 AM, S.Selvam Siva wrote:
    I have to do a parsing on webpagesand fetch urls.My problem is ,many
    urls i
    need to parse are dynamically loaded using javascript function
    (onload()).How to fetch those links from python? Thanks in advance.
    Selvam,
    You can try to find them yourself using string parsing, but that's
    difficult. The closer you want to get to "perfect" at finding URLs
    expressed in JS, the closer you'll get to rewriting a JS interpreter.
    For instance, this is not so hard to understand:
    "http://example.com/"
    but this is:
    "http://ZZZ_DOMAIN_ZZZ/index.html".rep lace(/ZZZ_DOMAIN_ZZZ/,
    the_domain_vari able)

    This is a long-standing problem for any program that parses Web pages.
    You either have to embed a JS interpreter in your application or just
    ignore the JavaScript. Most Web parsing robots take the latter route.

    Good luck
    Philip
  • lkcl

    #2
    Re: parsing javascript

    On Oct 12, 2:28 pm, Philip Semanchuk <phi...@semanch uk.comwrote:
    On Oct 12, 2008, at 5:25 AM, S.SelvamSivawro te:
    >
    I have to do a parsing on webpagesand fetch urls.My problem is ,many
    urls i
    need to parse are dynamically loaded usingjavascript function
    (onload()).How to fetch those links from python? Thanks in advance.
    >
    Selvam,
    You can try to find them yourself using string parsing, but that's
    difficult. The closer you want to get to "perfect" at finding URLs
    expressed in JS, the closer you'll get to rewriting a JS interpreter.
    For instance, this is not so hard to understand:
    "http://example.com/"
    but this is:
    "http://ZZZ_DOMAIN_ZZZ/index.html".rep lace(/ZZZ_DOMAIN_ZZZ/,
    the_domain_vari able)
    >
    This is a long-standing problem for any program that parses Web pages.
    yep :)
    You either have to embed a JS interpreter in your application or
    yep.

    there are several.

    pyv8 is the newest addition: http://advogato.org/article/985.html

    it's a python wrapper around google's v8 javascript execution
    library.

    then there's pykhtml: http://paul.giannaros.org/pykhtml/

    it's a python wrapper around KHTML, providing very convenient access
    to KDE's HTML capabilities: what pykhtml does is "pretends" that the
    GUI part of KDE doesn't exist, so you can run your program as a
    command-line shell; it will execute the javascript, which you will
    have to wait a bit for of course; then you can walk the DOM tree
    (using pykhtml bindings) using pykhtml.DOM.get ElementById() and
    getElementsByTa gName("a") etc. etc. looking for the URLs.

    there's even an AJAX example included which does 1-second polling of
    the DOM model, waiting for a spell-checking web site to deliver the
    answer.

    then there's webkit, with the new glib bindings:


    which are then followed up by python bindings to _those_ bindings:


    this will also allow you to execute arbitrary javascript - again, it's
    similar to KHTML and in fact webkit really _is_ the KDE KHTML code
    (JavaScriptCore , KJS etc) but forked, improved, etc. etc.

    unfortunately, the glib bindings are tied - at three key and strategic
    locations - to gtk at the moment, which will take _very_ little work
    to "un"tie them [pay me and i'll do the work], so you would need to
    create a blank gtk window - just like is done with pykhtml, behind the
    scenes.

    it would be a very simple task to create a "dummy" - console-based -
    port of webkit, providing an array of callbacks which you must hand to
    the library. at the moment, the design of webkit is not particularly
    good in this respect: there are three ports, gtk, wx and qt, which are
    heavily tied in to webkit. it would be a _far_ better design to be
    passing in a struct containing function callbacks (rather a lot of
    them - about eighty!) and then what you could do is have a "console"-
    based port of webkit, which would do the job you needed.

    alternatively, if you don't mind wrapping a binary application with
    e.g. Popen3 then look at the webkit DumpRenderTree application, paying
    particular attention to using the --html option. you won't have any
    control over how long the javascript is executed for. after an
    arbitrary and small period of time, DumpRenderTree _stops_ executing
    the javascript and prints out the HTML DOM model (in a non-html-layout
    fashion - it's used for debugging and testing purposes but will
    suffice for your purposes).

    so, as it stands, pywebkitgtk is _no worse_ than pykhtml, but with a
    little bit of tweaking, the "gtk" could be removed from "pywebkitgt k"
    and you'd end up with... ohh... call it "pywebkitgl ib" ... which would
    be much better as a stand-alone library, for your purposes



    then there's also "spidermonk ey", which is mozilla's javascript
    engine. i haven't investigated this option: haven't had a need to.

    then there's also PyXPCOMExt, which is embedding python into mozilla,
    and from there you have PyDOM, which allows you access to the DOM
    model of the mozilla "thing". so, if you don't mind embedding your
    application into XULRunner, you've got a home for executing your app
    and obtaining the urls, post-javascript-execution.

    the neat thing about PyXPCOMExt is that you have complete and full
    access to python - so your app can make external TCP and UDP sockets,
    you can embed an entire _server_ in the damn thing if you want (you
    could embed... python-twisted if you wanted!) you can access the
    filesystem - anything. absolutely anything. reason: the _entire_
    python suite is embedded into the browser. every single bit of it.


    that's about all i've been able to find, so far. there might be more
    options out there. not that there aren't enough already :)

    all of them will allow you complete and full access to execution of
    javascript, including AJAX execution. which is why you'll need to do
    that "polling" trick in many instances.

    l.

    Comment

    Working...