How to filter the words in HTML ?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • ChianHsieh@gmail.com

    How to filter the words in HTML ?

    Hi,

    I face some problem that I want to filter the all words in HTML.

    Example:

    Before Filter:
    <div id="pp"hello man <br/Thank's for your answer. </div>

    After Filter:
    <div id="pp"<br/</div>

    What I want is reserve all HTML tags but words.
    Is there any good packages or classes or suggestion ? Thank you very
    much.

  • Pedro Graca

    #2
    Re: How to filter the words in HTML ?

    ChianHsieh@gmai l.com wrote:
    Example:
    >
    Before Filter:
    <div id="pp"hello man <br/Thank's for your answer. </div>
    >
    After Filter:
    <div id="pp"<br/</div>
    I lova good challenges :)


    <?php
    function get_html($x, $sep=' ') {
    $inbrackets = false;
    $inquotes = false;
    $html = '';
    $l = strlen($x);
    for ($i = 0; $i < $l; ++$i) {
    $y = substr($x, $i, 1);
    if (($inbrackets) && ($y == '"')) {
    $inquotes = !$inquotes;
    }
    if ((!$inquotes) && ($y == '<')) {
    if ($i 0) {
    $html .= $sep;
    }
    $inbrackets = true;
    }
    if ($inbrackets) {
    $html .= $y;
    }
    if ((!$inquotes) && ($y == '>')) {
    $inbrackets = false;
    }
    }
    return $html;
    }

    $data = <<<HTML
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
    <html>
    <head><title>ex ample</title></head>
    <body>
    <div id="pp">
    <a name="funky><na me"></a>
    <!-- *VALID*^^*HTML* -->
    hello man<br/>
    Thank's for your answer.
    </div>
    </body>
    </html>
    HTML;

    $html = get_html($data, "\n");
    echo $html;
    ?>

    Or you could try a regular expression (but I'm not sure you could do one
    that accepts all valid HTML).

    --
    File not found: (R)esume, (R)etry, (R)erun, (R)eturn, (R)eboot

    Comment

    • p.lepin@ctncorp.com

      #3
      Re: How to filter the words in HTML ?


      ChianHsieh@gmai l.com wrote:
      I face some problem that I want to filter the all words
      in HTML.
      >
      Before Filter:
      <div id="pp"hello man <br/Thank's for your answer.
      </div>
      >
      After Filter:
      <div id="pp"<br/</div>
      Forget regexes. As the saying goes, 'You cannot parse HTML
      with regexes'. There's also no reason to write your own
      HTML parser -- there already are more than enough of those.

      XSLT was meant exactly for this type of processing, and it
      doesn't really care what you're processing, as long as it's
      a DOMDocument.

      Using PHP5's DOM and XSL modules:

      <?php
      $xml_str =
      '<div id="pp"><phell o man <br/Thank\'s for your ' .
      'answer. </div>' ;
      $xsl_str =
      '<xsl:styleshee t ' .
      ' xmlns:xsl="http ://www.w3.org/1999/XSL/Transform" ' .
      ' version="1.0">' .
      ' <xsl:template match="node()|@ *">' .
      ' <xsl:copy>' .
      ' <xsl:apply-templates select="node()| @*"/>' .
      ' </xsl:copy>' .
      ' </xsl:template>' .
      ' <xsl:template match="html">' .
      ' <xsl:apply-templates/>' .
      ' </xsl:template>' .
      ' <xsl:template match="body">' .
      ' <result>' .
      ' <xsl:apply-templates/>' .
      ' </result>' .
      ' </xsl:template>' .
      ' <xsl:template match="text()"/>' .
      ' </xsl:stylesheet> ' ;

      $xml = DOMDocument :: loadHTML ( $xml_str ) ;
      $xsl = DOMDocument :: loadXML ( $xsl_str ) ;
      $xform = new XSLTProcessor ( ) ;
      $xform -importStyleshee t ( $xsl ) ;
      $result = $xform -transformToDoc ( $xml ) ;
      header ( 'Content-type: text/xml' ) ;
      print ( $result -saveXML ( ) ) ;
      ?>

      If you're using real XHTML (as opposed to mumbo jumbo tag
      soup pretending to be XHTML), it's even better, because you
      don't have to pretend you're processing XML. XHTML *is*
      XML.

      --
      Pavel Lepin

      Comment

      Working...