Creating an HTML parser using PHP

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • hello2008
    New Member
    • Dec 2007
    • 14

    Creating an HTML parser using PHP

    Hi,

    I am new to PHP. I need to write a PHP program that parses HTML files, reads the values from certain form-fields and inserts them as records into the database.

    The latter part is easy, but I have no clue about making an HTML parser using PHP. Can anyone help me out here?

    Thanks in advance!
  • karlectomy
    New Member
    • Sep 2007
    • 64

    #2
    Originally posted by hello2008
    Hi,

    I am new to PHP. I need to write a PHP program that parses HTML files, reads the values from certain form-fields and inserts them as records into the database.

    The latter part is easy, but I have no clue about making an HTML parser using PHP. Can anyone help me out here?

    Thanks in advance!
    Are you parsing HTML files that are already created?

    You might want to check this out:

    Comment

    • hello2008
      New Member
      • Dec 2007
      • 14

      #3
      Originally posted by karlectomy
      Are you parsing HTML files that are already created?

      You might want to check this out:
      http://us3.php.net/manual/en/function.file.php
      Hi karlectomy,

      Thanks for replying. Yes, I am parsing HTML files that are already created. But the files are created from PDFs and what I need to do is read the HTML file, extract all it's elements' contents, make a query with those extracted values and insert the records into the database. I am using regular expressions for the HTML tags

      So far my code is as foll:

      [PHP]
      <?php
      $page_title = "n/a";
      $meta_descr = "n/a";
      $meta_keywd = "n/a";


      if ($handle = @fopen("temp.ht ml", "r")) {
      $content = "";
      while (!feof($handle) ) {
      $part = fread($handle, 1024);
      $content .= $part;
      if (eregi("</head>", $part)) break;
      }
      fclose($handle) ;
      $lines = preg_split("/\r?\n|\r/", $content); // turn the content in rows
      $is_title = false;
      $is_author = false;
      $is_descr = false;
      $is_keywd = false;
      $close_tag = ($xhtml) ? " />" : ">"; // new in ver. 1.01
      foreach ($lines as $val) {
      if (eregi("<title> (.*)</title>", $val, $title)) {
      $page_title = $title[1];
      echo 'page_title: ' . $page_title;

      $is_title = true;
      }
      if (eregi("<meta name=\"author\" content=\"(.*)\ "([[:space:]]?/)?>", $val, $author)) {
      $page_author = $author[1];
      echo 'page_author: ' . $page_author;

      $is_author = true;
      }
      if (eregi("<meta name=\"descript ion\" content=\"(.*)\ "([[:space:]]?/)?>", $val, $descr)) {
      $meta_descr = $descr[1];
      echo 'meta_descr: ' . $meta_descr;

      $is_descr = true;
      }
      if (eregi("<meta name=\"keywords \" content=\"(.*)\ "([[:space:]]?/)?>", $val, $keywd)) {
      $meta_keywd = $keywd[1];
      echo 'meta_keywd: ' . $meta_keywd;

      $is_keywd = true;
      }
      if ($is_title && $is_author && $is_descr && $is_keywd) break;
      }
      }
      ?>

      [/PHP]

      But this only parses the <HEAD></HEAD> tag, parsing the <BODY></BODY> is a real challenge and I needed help with that.

      Thanks,
      sasha

      Comment

      • karlectomy
        New Member
        • Sep 2007
        • 64

        #4
        I think you're on the right track. I also am currently working on parsing large volumes of text. You have the right idea. Go with it. The conditional logic can get cluttered so try to keep it simple, otherwise it will be difficult to debug in the future.

        It looks like you know what you're doing.

        Comment

        • hello2008
          New Member
          • Dec 2007
          • 14

          #5
          Originally posted by karlectomy
          I think you're on the right track. I also am currently working on parsing large volumes of text. You have the right idea. Go with it. The conditional logic can get cluttered so try to keep it simple, otherwise it will be difficult to debug in the future.

          It looks like you know what you're doing.
          Thanks :)
          I am not that well versed with Regex. Hence it's taking more time than I expected. I am glad a lot of online help is available.

          Comment

          Working...