how to webcrawl disconnected components

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • anklos
    New Member
    • Sep 2008
    • 30

    how to webcrawl disconnected components

    Hi~!

    I am doing webcrawl from a few url links(given by a enclosed set of htmls), but I can't succeed to crawl disconnected components. Could anyone give some advice on it? Thanks for any suggestion!
  • KevinADC
    Recognized Expert Specialist
    • Jan 2007
    • 4092

    #2
    No idea what a disconnected component is.

    Comment

    • anklos
      New Member
      • Sep 2008
      • 30

      #3
      Originally posted by KevinADC
      No idea what a disconnected component is.
      Most webpages have links to other webpages, meanwhile, they are also linked by other pages too. But there exsit some webpages having not any link at all, and no link to them either. So it's hard to find them out in a given enclosed set of htmls.

      Comment

      • KevinADC
        Recognized Expert Specialist
        • Jan 2007
        • 4092

        #4
        Originally posted by anklos
        Most webpages have links to other webpages, meanwhile, they are also linked by other pages too. But there exsit some webpages having not any link at all, and no link to them either. So it's hard to find them out in a given enclosed set of htmls.
        Can you give an example?

        Comment

        • anklos
          New Member
          • Sep 2008
          • 30

          #5
          Originally posted by KevinADC
          Can you give an example?
          for instance, http://xxxx//00.html----http://xxxx//99.html, total 100 htmls, there are links to connect most of them. But there is no link to http://xxxx//45.html and tp://xxxx//78.html, both of them dont contain any links out either. So we can webcrawl from http://xxxx//00html, crawl and crawl, but can't arrive those two disconnected htmls

          Comment

          • numberwhun
            Recognized Expert Moderator Specialist
            • May 2007
            • 3467

            #6
            Originally posted by anklos
            for instance, http://xxxx//00.html----http://xxxx//99.html, total 100 htmls, there are links to connect most of them. But there is no link to http://xxxx//45.html and tp://xxxx//78.html, both of them dont contain any links out either. So we can webcrawl from http://xxxx//00html, crawl and crawl, but can't arrive those two disconnected htmls
            If I follow what you are saying correctly, the only way to get to these pages is to type in their address directly, this being because their are no links to them.

            If there are no links and you are navigating by crawling, then there is no way for your script to know those pages are there, unless they have a link on another page. That's just the way it is. Unless you specifically tell your script that it needs to go to those pages outside of the links it is following, how will it ever know they are there?

            Regards,

            Jeff

            Comment

            • anklos
              New Member
              • Sep 2008
              • 30

              #7
              Originally posted by numberwhun
              If I follow what you are saying correctly, the only way to get to these pages is to type in their address directly, this being because their are no links to them.

              If there are no links and you are navigating by crawling, then there is no way for your script to know those pages are there, unless they have a link on another page. That's just the way it is. Unless you specifically tell your script that it needs to go to those pages outside of the links it is following, how will it ever know they are there?

              Regards,

              Jeff
              Yeah, it puzzles me. If my program tries to type address directly, it should get some hints before I do it. I mean, there are millions of htmls(000000000 .html-------999999999.html, but not all of them exsit) in my given enclosed set, it's too much time-consuming to test every address exsiting or not. If depict web structure as a bow-tie strcuture, what I have to find are those tendrils

              Comment

              • KevinADC
                Recognized Expert Specialist
                • Jan 2007
                • 4092

                #8
                Originally posted by anklos
                Yeah, it puzzles me. If my program tries to type address directly, it should get some hints before I do it. I mean, there are millions of htmls(000000000 .html-------999999999.html, but not all of them exsit) in my given enclosed set, it's too much time-consuming to test every address exsiting or not. If depict web structure as a bow-tie strcuture, what I have to find are those tendrils
                Personally, I don't see how your problem is even related to perl. Perl certainly can't find links to web pages that don't exist. Your problem is beyond the scope of perl.

                Comment

                • anklos
                  New Member
                  • Sep 2008
                  • 30

                  #9
                  Originally posted by KevinADC
                  Personally, I don't see how your problem is even related to perl. Perl certainly can't find links to web pages that don't exist. Your problem is beyond the scope of perl.

                  Ummm..since I'm new to using perl to solve this problem, unsure about the scope of perl.

                  Comment

                  • Icecrack
                    Recognized Expert New Member
                    • Sep 2008
                    • 174

                    #10
                    Your Best bet is to generate a random name for html files and finding links by checking for Non 404 errors or any other error and if there is no error found, then saving it for review, Note: this method will take a lot of processing power.

                    *but it will solve your problem*,

                    Comment

                    Working...