Parse text from HTML website, dump into DB

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • IceOnFire

    Parse text from HTML website, dump into DB

    I am working on a script to extract statistics (which is updated daily) from
    a website, and insert them into a MySQL database. I want to take this
    website:

    and strip off all the HTML tags and etc, make it look like

    and then insert each players stat line into the database.

    I have begun writing the script, getting the file, striping html tags off,
    but that doesn't seem to work too well. If anyone can help me get started,
    suggest a function or anything else, that would be helpful. Thanks.

    IceOnFire


  • Michael Vilain

    #2
    Re: Parse text from HTML website, dump into DB

    In article <105cvgrjkbbh8e 6@corp.supernew s.com>,
    "IceOnFire" <af@iceonfire.n et> wrote:
    [color=blue]
    > I am working on a script to extract statistics (which is updated daily) from
    > a website, and insert them into a MySQL database. I want to take this
    > website:
    > http://www.usatoday.com/sports/baske...layers0304.htm
    > and strip off all the HTML tags and etc, make it look like
    > http://www.enlhoops.com/ratings/parsed.txt
    > and then insert each players stat line into the database.
    >
    > I have begun writing the script, getting the file, striping html tags off,
    > but that doesn't seem to work too well. If anyone can help me get started,
    > suggest a function or anything else, that would be helpful. Thanks.
    >
    > IceOnFire
    >
    >[/color]

    Use perl. It's more suited to this sort of thing and can run
    independently from the command line.

    CPAN modules allow you to extend perl to access sites as if you were
    browser, including accepting cookies.

    --
    DeeDee, don't press that button! DeeDee! NO! Dee...



    Comment

    • Andy Jeffries

      #3
      Re: Parse text from HTML website, dump into DB

      IceOnFire wrote:
      [color=blue]
      > I am working on a script to extract statistics (which is updated daily) from
      > a website, and insert them into a MySQL database. I want to take this
      > website:
      > http://www.usatoday.com/sports/baske...layers0304.htm
      > and strip off all the HTML tags and etc, make it look like
      > http://www.enlhoops.com/ratings/parsed.txt
      > and then insert each players stat line into the database.
      >
      > I have begun writing the script, getting the file, striping html tags off,
      > but that doesn't seem to work too well. If anyone can help me get started,
      > suggest a function or anything else, that would be helpful. Thanks.[/color]

      Here is some example code I wrote to do a very similar thing for the
      BBC's Fantasy Football system (so I can view them on my Nokia 3650
      phone). It's not perfect (in fact it's quite dirty) but it does the
      trick and it may help get you started:




      <?php
      print '<?xml version="1.0"?> ';
      ?>
      <!DOCTYPE html PUBLIC "-//WAPFORUM//DTD XHTML Mobile 1.0//EN"
      "http://www.wapforum.or g/DTD/xhtml-mobile10.dtd" >

      <html xmlns="http://www.w3.org/1999/xhtml">
      <head>
      <title>BBC Fantasy Football</title>
      <style type="text/css">
      p, body, td, th { font-family: Arial, Helvetica,
      Sans-Serif; font-size: medium; }
      th { background-color: #efefce; }
      td { background-color: #ffffde; }
      h1 { font-size: large;?>
      </style>
      </head>

      <body>
      <p align='center'> <img src="bbcsport_l ogo.gif" alt="BBC Sport"
      /></p>
      <h1>Team for <?=$name?></h1>
      <div align='center'>

      <?php

      $page =
      file_get_conten ts("http://bbcfootball.fan tasyleague.co.u k/team/teamscreen.asp? pin=$id");

      $page = str_replace("\n ", "", $page);


      if (preg_match("/CURRENT FIRST 11(.*?)<\/table>/m", $page, $matches)) {
      print "<table><tr><th >Player</th><th width='20'>P</th><th
      width='30'>C</th><th width='20'>W</th><th width='20'>M</th></tr>";
      $table = $matches[1];

      preg_match_all( "/(<tr>.*?<\/tr>)/", $table, $matches);
      for ($n=0; $n<count($match es[1]); $n++) {
      if (preg_match("/^.*?<td
      ..*?\/td><td.*?>(\d+) .*<\/td><td.*>(\S+)< \/td><td.*>(\S+)< \/td>.*?squad_(\S ).gif.*?<td.*>( \S
      +)<\/td><td.*>(\S+)< \/td><td.*>(\S+)< \/td><td.*>(\S+)< \/td>/",
      $matches[1][$n], $player)) {

      switch ($player[4]) {
      case "g": $pos='GK'; break;
      case "f": $pos='FB'; break;
      case "c": $pos='CB'; break;
      case "m": $pos='MF'; break;
      case "s": $pos='SK'; break;
      }

      $club = str_replace("&n bsp;", "", $player[5]);

      print "<tr><td
      align='left'>$p layer[2]$player[3]</td><td align='center'> $pos</td><td
      align='center'> $club</t
      d><td align='center'> $player[7]</td><td
      align='center'> $player[8]</td></tr>";
      }
      }
      print "</table>";
      }
      else {
      print "<p><b>Currentl y updating...</b></p>";
      }

      ?>

      </div>

      </body>
      </html>




      Best of luck,


      Andy

      Comment

      Working...