how to webcrawl disconnected components

**KevinADC** · Sep 9 '08, 05:57 PM

No idea what a disconnected component is.

**anklos** · Sep 10 '08, 02:56 AM

Originally posted by KevinADC

No idea what a disconnected component is.

Most webpages have links to other webpages, meanwhile, they are also linked by other pages too. But there exsit some webpages having not any link at all, and no link to them either. So it's hard to find them out in a given enclosed set of htmls.

**KevinADC** · Sep 10 '08, 05:43 AM

Originally posted by anklos

Most webpages have links to other webpages, meanwhile, they are also linked by other pages too. But there exsit some webpages having not any link at all, and no link to them either. So it's hard to find them out in a given enclosed set of htmls.

Can you give an example?

**anklos** · Sep 10 '08, 11:57 AM

Originally posted by KevinADC

Can you give an example?

for instance, http://xxxx//00.html----http://xxxx//99.html, total 100 htmls, there are links to connect most of them. But there is no link to http://xxxx//45.html and tp://xxxx//78.html, both of them dont contain any links out either. So we can webcrawl from http://xxxx//00html, crawl and crawl, but can't arrive those two disconnected htmls

**numberwhun** · Sep 10 '08, 12:58 PM

Originally posted by anklos

for instance, http://xxxx//00.html----http://xxxx//99.html, total 100 htmls, there are links to connect most of them. But there is no link to http://xxxx//45.html and tp://xxxx//78.html, both of them dont contain any links out either. So we can webcrawl from http://xxxx//00html, crawl and crawl, but can't arrive those two disconnected htmls

If I follow what you are saying correctly, the only way to get to these pages is to type in their address directly, this being because their are no links to them.

If there are no links and you are navigating by crawling, then there is no way for your script to know those pages are there, unless they have a link on another page. That's just the way it is. Unless you specifically tell your script that it needs to go to those pages outside of the links it is following, how will it ever know they are there?

Regards,

Jeff

**anklos** · Sep 10 '08, 03:02 PM

Originally posted by numberwhun

If I follow what you are saying correctly, the only way to get to these pages is to type in their address directly, this being because their are no links to them.

If there are no links and you are navigating by crawling, then there is no way for your script to know those pages are there, unless they have a link on another page. That's just the way it is. Unless you specifically tell your script that it needs to go to those pages outside of the links it is following, how will it ever know they are there?

Regards,

Jeff

Yeah, it puzzles me. If my program tries to type address directly, it should get some hints before I do it. I mean, there are millions of htmls(000000000 .html-------999999999.html, but not all of them exsit) in my given enclosed set, it's too much time-consuming to test every address exsiting or not. If depict web structure as a bow-tie strcuture, what I have to find are those tendrils

**KevinADC** · Sep 10 '08, 03:55 PM

Originally posted by anklos

Yeah, it puzzles me. If my program tries to type address directly, it should get some hints before I do it. I mean, there are millions of htmls(000000000 .html-------999999999.html, but not all of them exsit) in my given enclosed set, it's too much time-consuming to test every address exsiting or not. If depict web structure as a bow-tie strcuture, what I have to find are those tendrils

Personally, I don't see how your problem is even related to perl. Perl certainly can't find links to web pages that don't exist. Your problem is beyond the scope of perl.

**anklos** · Sep 11 '08, 02:53 AM

Originally posted by KevinADC

Personally, I don't see how your problem is even related to perl. Perl certainly can't find links to web pages that don't exist. Your problem is beyond the scope of perl.

Ummm..since I'm new to using perl to solve this problem, unsure about the scope of perl.

**Icecrack** · Sep 15 '08, 06:18 AM

Your Best bet is to generate a random name for html files and finding links by checking for Non 404 errors or any other error and if there is no error found, then saving it for review, Note: this method will take a lot of processing power.

*but it will solve your problem*,

how to webcrawl disconnected components

how to webcrawl disconnected components

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment