LWP questions

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Richard Bell

    LWP questions


    I'm returning to Perl and Linux after many years away and while I
    know/knew way back when about Perl and Unix I'm new to this world
    today.

    I'm considering using LWP as the heart of a Web application and have a
    number of questions.

    It appears to me that the Get method returns ONLY the content of the
    single object referenced by the URL. Is this correct? To what
    degree, if any, does LWP Get deal with script on the page that may be
    involved in building the page content?

    In the end, I need to get a page in much the same way a browser does
    and then examine it, looking at the text on the page (as it would be
    rendered by IE or Mozilla) for a bunch of stuff. I also need to
    examine the HTML as it exist in the abstract for the page as actually
    displayed for a bunch of stuff. On XP (no flame please, surely Perl
    programmers can forgive an attachment to the ugly real world) the IE
    object model has two objects InnerText and InnerHTML. InnerText is a
    linearized version of the text as displayed on the page AFTER all
    scripts have executed. InnerHTML seems to be the HTML that would
    exist to create the page AFTER all scripts have executed. It is this
    kind of structure that I need. Can LWP help me here? What is the
    basic attack? Are there any examples in the Perl world.

    Thanks for any help/clues.

    R
  • Roel van der Steen

    #2
    Re: LWP questions

    On Tue, 16 Mar 2004 at 18:01 GMT, Richard Bell <rbell01824@ear thlink.net> wrote:[color=blue]
    > I'm considering using LWP as the heart of a Web application and have a
    > number of questions.[/color]

    LWP does not render the page, nor does it execute (client-side)
    scripts, nor does it provide you with a DOM. However, you can
    get the HTML using LWP and parse that with any of the available
    HTML parsers (e.g., HTML-TreeBuilder).


    #!/usr/bin/perl
    use strict;
    use warnings;
    use HTML::TreeBuild er;
    use LWP::Simple;

    my $cachefile = 'mirrored.htm';

    mirror('http://cpan.org', $cachefile);

    my $tree = HTML::TreeBuild er->new_from_file( $cachefile);

    my $h1 = $tree->look_down('_ta g', 'table');
    print $h1->as_text if $h1;

    Comment

    • Richard Bell

      #3
      Re: LWP questions

      Thanks Roel, that was very helpful.

      For my application, I need something that will do all such things as
      might happen in a real browser that would create user visible content
      on the screen. For many of the pages I'll be working with that
      includes various client side scripts and includes. While LWP gets
      part of the way, it doesn't seem to go as far as this project needs.

      As I mentioned, I'm newly returned to Unix/Linux and Perl. Is there
      something that might be more appropriate? I've some previous
      experience in IE com automation under XP. Can I play the same sort of
      game (or hopefully a simpler one) under Linux? What do I use for an
      engine? Can I get by with wget (it seems to do a good job of
      mirroring)? Will I need to work with Mozilla?

      I'd appreciate any advice.

      Thanks again.

      R

      On 17 Mar 2004 00:55:24 GMT, Roel van der Steen <roel-perl@st2x.net>
      wrote:
      [color=blue]
      >On Tue, 16 Mar 2004 at 18:01 GMT, Richard Bell <rbell01824@ear thlink.net> wrote:[color=green]
      >> I'm considering using LWP as the heart of a Web application and have a
      >> number of questions.[/color]
      >
      >LWP does not render the page, nor does it execute (client-side)
      >scripts, nor does it provide you with a DOM. However, you can
      >get the HTML using LWP and parse that with any of the available
      >HTML parsers (e.g., HTML-TreeBuilder).
      >
      >
      >#!/usr/bin/perl
      >use strict;
      >use warnings;
      >use HTML::TreeBuild er;
      >use LWP::Simple;
      >
      >my $cachefile = 'mirrored.htm';
      >
      >mirror('http ://cpan.org', $cachefile);
      >
      >my $tree = HTML::TreeBuild er->new_from_file( $cachefile);
      >
      >my $h1 = $tree->look_down('_ta g', 'table');
      >print $h1->as_text if $h1;[/color]

      Comment

      • Roel van der Steen

        #4
        Re: LWP questions

        (Top-posting reordered.)

        On Wed, 17 Mar 2004 at 01:50 GMT, Richard Bell <rbell01824@ear thlink.net> wrote:[color=blue]
        > On 17 Mar 2004 00:55:24 GMT, Roel van der Steen <roel-perl@st2x.net>
        > wrote:
        >[color=green]
        >>On Tue, 16 Mar 2004 at 18:01 GMT, Richard Bell <rbell01824@ear thlink.net> wrote:[color=darkred]
        >>> I'm considering using LWP as the heart of a Web application and have a
        >>> number of questions.[/color]
        >>
        >>LWP does not render the page, nor does it execute (client-side)
        >>scripts, nor does it provide you with a DOM.[/color]
        >
        > For many of the pages I'll be working with that
        > includes various client side scripts and includes.
        >[/color]
        Maybe HTML:Display is more in the direction you want. Or WWW::Mechanize.
        Did you already have a look at http://cpan.org ?

        Comment

        • Richard Bell

          #5
          Re: LWP questions

          On 17 Mar 2004 03:12:38 GMT, Roel van der Steen <roel-perl@st2x.net>
          wrote:
          [color=blue]
          >(Top-posting reordered.)
          >
          >On Wed, 17 Mar 2004 at 01:50 GMT, Richard Bell <rbell01824@ear thlink.net> wrote:[color=green]
          >> On 17 Mar 2004 00:55:24 GMT, Roel van der Steen <roel-perl@st2x.net>
          >> wrote:
          >>[color=darkred]
          >>>On Tue, 16 Mar 2004 at 18:01 GMT, Richard Bell <rbell01824@ear thlink.net> wrote:
          >>>> I'm considering using LWP as the heart of a Web application and have a
          >>>> number of questions.
          >>>
          >>>LWP does not render the page, nor does it execute (client-side)
          >>>scripts, nor does it provide you with a DOM.[/color]
          >>
          >> For many of the pages I'll be working with that
          >> includes various client side scripts and includes.
          >>[/color]
          >Maybe HTML:Display is more in the direction you want. Or WWW::Mechanize.[/color]

          Thanks, I'll look into HTML:Display and WWW:Mechanize. I picked up
          the O'Reilly books and am also checking the web on these packages, but
          the learning curve right now is a bit stiff particularly when I'm not
          really sure where to look or what to look at. Thanks for your help.
          [color=blue]
          >Did you already have a look at http://cpan.org ?[/color]

          I have checked cpan. Lots of apparently good stuff there, but again
          I'm faced with not knowing what is really appropriate for my needs.

          I've thought about trying to automate Mozilla and accessing its DOM
          object to get at what I want. Do you have any reflections on that
          attack?

          Thanks again for the new clues.

          R

          Comment

          • Joe Smith

            #6
            Re: LWP questions

            Richard Bell wrote:
            [color=blue]
            > Thanks Roel, that was very helpful.
            >
            > For my application, I need something that will do all such things as
            > might happen in a real browser that would create user visible content
            > on the screen. For many of the pages I'll be working with that
            > includes various client side scripts and includes. While LWP gets
            > part of the way, it doesn't seem to go as far as this project needs.[/color]

            When LWP requests a page from a server, it is no different than any
            other brower's request, in that the server will process server-side
            includes.

            If the HTML returned contains JavaScript, it is up to you to provide
            a JavaScript interpreter. I've seen many JavaScript functions that
            do things like ask the graphic brower it is running in as to the
            size (in pixels) of the currently active window so that it can
            decide on the layout of the text is will be writing to the
            document window. Other JavaScript uses include reading or
            modifying the text being displayed in a field of a form. (Think of
            <input type="text" name="clock" value="12:45:00 pm">.)

            In other words, to handle a full range of client-side scripts,
            you will have to re-invent a very large wheel: a complete browser
            with graphical display and GUI widgets.

            LWP is good at getting the raw HTML from the server. Postprocessing
            the HTML on the client side before, during, and after rendering is
            an entirely different kettle of fish.

            I certainly would not want to emulate the quirks (features, bugs) of
            IE 6 vs IE 5 vs Netscape vs Mozilla vs Opera.
            -Joe

            specific.

            Comment

            • Richard Bell

              #7
              Re: LWP questions


              No one ever said it would be easy.

              I'm now looking into automating Mozilla (let it do the heavy lifting),
              possibly from perl, possibly using the Mozilla application
              environment. Any ideas where I can get clues/examples/insight into
              the issues from the perl side? I've got the O'Reilly book for the app
              environment so I'm reasonably armed there.

              Richard

              On Sat, 20 Mar 2004 22:33:08 GMT, Joe Smith <Joe.Smith@inwa p.com>
              wrote:
              [color=blue]
              >Richard Bell wrote:
              >[color=green]
              >> Thanks Roel, that was very helpful.
              >>
              >> For my application, I need something that will do all such things as
              >> might happen in a real browser that would create user visible content
              >> on the screen. For many of the pages I'll be working with that
              >> includes various client side scripts and includes. While LWP gets
              >> part of the way, it doesn't seem to go as far as this project needs.[/color]
              >
              >When LWP requests a page from a server, it is no different than any
              >other brower's request, in that the server will process server-side
              >includes.
              >
              >If the HTML returned contains JavaScript, it is up to you to provide
              >a JavaScript interpreter. I've seen many JavaScript functions that
              >do things like ask the graphic brower it is running in as to the
              >size (in pixels) of the currently active window so that it can
              >decide on the layout of the text is will be writing to the
              >document window. Other JavaScript uses include reading or
              >modifying the text being displayed in a field of a form. (Think of
              ><input type="text" name="clock" value="12:45:00 pm">.)
              >
              >In other words, to handle a full range of client-side scripts,
              >you will have to re-invent a very large wheel: a complete browser
              >with graphical display and GUI widgets.
              >
              >LWP is good at getting the raw HTML from the server. Postprocessing
              >the HTML on the client side before, during, and after rendering is
              >an entirely different kettle of fish.
              >
              >I certainly would not want to emulate the quirks (features, bugs) of
              >IE 6 vs IE 5 vs Netscape vs Mozilla vs Opera.
              > -Joe
              >
              >specific.[/color]

              Comment

              Working...