Translating Foreign HTML Code

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • alnoir
    New Member
    • Apr 2007
    • 23

    Translating Foreign HTML Code

    I'm working on this script that grabs a web page from a foreign site, searches it for specific information, and grabs web pages from links on the original page. Once I had it working, I tried it out on the foreign site. However, the information I got back was nonsensical. I'm guessing the code I get back from the web page is written in that foreign language, but when I [view]->[page source] of the same page, it looks like normal html code.

    Does anyone know what is happening here or how to fix it?

    Here is my script:
    Code:
    use strict;
    use WWW::Mechanize;
    
    my $mech = WWW::Mechanize->new();
    my $page = $mech->get('http://russia.ru');
    
    print $page->content;
    Here is what the page looks like when I view the source code from my browser:


    Here is what html code is returned after the script is run:


    I hope I've provided adequate information. Thank you for all of your help.
  • eWish
    Recognized Expert Contributor
    • Jul 2007
    • 973

    #2
    There is nothing wrong with the code you posted. Have you tried to check other sites? The Russian site you are trying to view is mostly flash content. That is likely your problem.

    --Kevin

    Comment

    • alnoir
      New Member
      • Apr 2007
      • 23

      #3
      Thanks for your input!

      I'm developing this script for web sites from many different countries. The first that I it tried on was another russian pages. I simply provided russia.ru as an example. I want to be able to search the content retrieved for different strings, but I don't know how I can do that if the content isn't normal html.

      Comment

      • KevinADC
        Recognized Expert Specialist
        • Jan 2007
        • 4092

        #4
        html is only written in english as far as I know.

        Comment

        • alnoir
          New Member
          • Apr 2007
          • 23

          #5
          I think I misdiagnosed the problem. I believe now that what I'm getting back from these web pages is raw php content, because the forums (english or foreign) I tried were all written in php.

          This is what I get back:


          By alnoir

          It doesn't seem to be formatted or even recognizable code, however, that's what it is. Does this familiar to anyone? Can perl interpret this so that information can be extracted?

          Thank you everyone.

          Comment

          • KevinADC
            Recognized Expert Specialist
            • Jan 2007
            • 4092

            #6
            Looks like some kind of binary code. Could be flash or something similar. I have no idea if perl can translate that into something useful.

            Comment

            • numberwhun
              Recognized Expert Moderator Specialist
              • May 2007
              • 3467

              #7
              Originally posted by KevinADC
              Looks like some kind of binary code. Could be flash or something similar. I have no idea if perl can translate that into something useful.
              Agreed! There is no way to get the raw PHP code. That is one nice thing about PHP is you cannot just "get" the raw code as it is automatically converted to the HTML output before presenting it to the user.

              I also agree that that looks more like binary output than anything else and I don't think that there is any way for Perl to help you with that. If you look at eWish's earlier post, I think he stated it correctly that because this page seems to be flash driven, that is why you are getting binary data, instaed of the html. Why don't you pick a site that is not flash based and try to grab it?

              Regards,

              Jeff

              Comment

              • alnoir
                New Member
                • Apr 2007
                • 23

                #8
                With help from a friend, I finally got the script to work. Instead of using the module that I was, I tried using LWP and it worked great. Thank you to all the people who took time to help me with the problems I was encountering. I appreciate the help.

                Comment

                Working...