screen scraping

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • clint@pidlubny.com

    screen scraping

    I'm writing an application to scrape the code from client web sites to
    look for links on the pages. I am using file_get_conten ts() function to
    grad the code, but I don't know how to control for web sites that may
    be down or unavailable. I know file_get_conten ts() returns FALSE on a
    failure, but the error message still prints to the screen. How do I
    avoid that?

    Here a snippet of my code:

    function urlcheck($url, $sitelink) {
    // grab code from web site
    if (file_get_conte nts($url)){
    $html = file_get_conten ts($url);
    //REGEX to pull the link code out of the array
    $relink = "/<a.+?href=[\"\'](.*?)[\"\'].+?\>/i";

    // Put the matching link code into an array called links
    preg_match_all( $relink, $html, $links);

    // loop through links on the page and look for a match
    for ($i=0; $i< count($links[0]); $i++) {
    if ( strpos($links[1][$i], $sitelink) != false ||
    strpos($links[1][$i], $sitelink) === 0 ) {
    return $links[0][$i];
    break;
    }
    }
    }
    else {
    print "Doesn't exist";
    }
    }

    Thanks
    Clint Pidlubny

  • DJ Majestik

    #2
    Re: screen scraping

    How about reg ex'ing for particular error codes? You can look for
    specific ones like 404, 500, should be the normal ones you would be
    getting. If you see that in your return code, you know you have an
    error.

    HTH

    JJ

    Comment

    • Clintster

      #3
      Re: screen scraping

      I discovered if I set the URL check in a variable and check the
      variable, the error will not be output.

      i.e.

      function urlcheck($url, $sitelink) {
      $urlup = @file($url);
      // grab code from web site
      if ($urlup){
      $html = file_get_conten ts($url);
      //REGEX to pull the link code out of the array
      $relink = "/<a.+?href=[\"\'](.*?)[\"\'].+?\>/i";

      // Put the matching link code into an array called links
      preg_match_all( $relink, $html, $links);

      // loop through links on the page and look for a match
      for ($i=0; $i< count($links[0]); $i++) {
      if ( strpos($links[1][$i], $sitelink) != false ||
      strpos($links[1][$i], $sitelink) === 0 ) {
      return $links[0][$i];
      break;
      }
      }
      }
      else {
      print "Doesn't exist";
      }
      }

      Comment

      Working...