Help with a regular expression

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • YoBro

    Help with a regular expression

    Hi

    I have used some of this code from the PHP manual, but I am bloody hopeless
    with regular expressions.
    Was hoping somebody could offer a hand.

    The output of this will put the name of a form field beside name.
    I want to get the following but not sure how to modify the code below.
    1. Field Name (to appear beside NAME:)
    2. Field Type (to appear beside TYPE:)
    3. Field Value (to appear beside VALUE:)

    Make sense.
    It is part way there, just need some help finishing it.

    $filename = "form-eg.php"; // Open file to read HTML with Form code
    $fd = fopen ($filename, "rb");
    $contents = fread ($fd, filesize ($filename));
    preg_match_all ('/<input.*?name\\ s*=\\s*"?([^\\s>"]*)/i', $contents,
    $matches); // get all input fields and attributes and values

    for ($i=0; $i< count($matches[0]); $i++) {
    echo "matched: ".$matches[0][$i]."<br />\n";
    echo "NAME: ".$matches[1][$i]."<br />\n";
    echo "TYPE: ".$matches[3][$i]."<br />\n";
    echo "VALUE: ".$matches[4][$i]."<br />\n\n";
    }

    fclose ($fd);

    I will also need to run another check for :
    <select
    <textarea

    But I can probably figure that out from what I already have.

    Thanks,

    YoBro


  • Pedro Graca

    #2
    Re: Help with a regular expression

    YoBro wrote:[color=blue]
    > I have used some of this code from the PHP manual, but I am bloody hopeless
    > with regular expressions.[/color]

    Although I've heard often enough that RXs are not the best tool for this
    job (try a HTML or XML parser) I do very well with them myself :)
    [color=blue]
    > Was hoping somebody could offer a hand.
    >
    > The output of this will put the name of a form field beside name.
    > I want to get the following but not sure how to modify the code below.
    > 1. Field Name (to appear beside NAME:)
    > 2. Field Type (to appear beside TYPE:)
    > 3. Field Value (to appear beside VALUE:)[/color]

    But I follow a different path than you.

    <?php
    // initialize result data
    $html_input = array();
    $html_index = 0;

    // get HTML
    $contents = file_get_conten ts('http://www.faqs.org/rfcs/index.html');

    // get all "<input ... >"s -- usually I'd group them by <form>s too
    preg_match_all( '@(<input[^>]+>)@Ui', $contents, $inputs);

    // inside each "<input ... >" isolate the pairs "attr=value "
    foreach ($inputs[1] as $input) {
    // once for double quoted values
    preg_match_all( '@(([^\s<>]+)\s*=\s*"([^"<>]+)")@', $input, $matches);
    // save them
    foreach ($matches[0] as $k=>$dummy) {
    $html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
    }
    ++$html_index;

    // once for single quoted values
    preg_match_all( '@(([^\s<>]+)\s*=\s*\'([^\'<>]+)\')@', $input, $matches);
    foreach ($matches[0] as $k=>$dummy) {
    $html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
    }
    ++$html_index;

    // and once again for unquoted values
    preg_match_all( '@(([^\s<>]+)\s*=\s*([^\s<>"\']+))@', $input, $matches);
    foreach ($matches[0] as $k=>$dummy) {
    $html_inputs[$html_index][$matches[2][$k]] = $matches[3][$k];
    }
    ++$html_index;
    }

    // done, deal with them anyway I like
    echo '<pre>'; print_r($html_i nputs); echo '</pre>';
    ?>
    --
    --= my mail box only accepts =--
    --= Content-Type: text/plain =--
    --= Size below 10001 bytes =--

    Comment

    • John Dunlop

      #3
      Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)

      Pedro Graca wrote:
      [color=blue]
      > Although I've heard often enough that RXs are not the best tool for this
      > job (try a HTML or XML parser) I do very well with them myself :)[/color]

      I believe the principal reason why pre-written parsers are suggested
      and recommended instead of impromptu regular expression "one-liners"
      is that the gurus who've written and developed the parsers are
      usually aware of and understand the rules; the "one-line" regex
      implementors, on the other hand -- with all due respect -- generally
      aren't and don't. I'm not going to pretend I understand everything
      SGML; I certainly don't; I'm far too young for starters.

      I'd like to pass a few comments, nevertheless, which might change
      your mind about regular expressions for parsing (X)HTML. They
      changed my mind, anyway. You'll understand though, hopefully, why I
      haven't offered any regular expression in place of yours (no, it's
      not because I couldn't be bothered :-)).

      (Trying to cope with shorthand markup when using regexes would be a
      nightmare. Unlike proper parsers, I'm going to act like a browser
      and ignore shorthand markup, for the time being, as it'd complicate
      matters even more.)
      [color=blue]
      > // get all "<input ... >"s -- usually I'd group them by <form>s too
      > preg_match_all( '@(<input[^>]+>)@Ui', $contents, $inputs);[/color]

      There's the standard mistake: the next occurrence of ">" does not
      necessarily mark the end of the tag. In HTML, a ">" can appear in
      *quoted* attribute values; it cannot appear in unquoted attribute
      values. This, for example, is a valid INPUT element (I make no
      claims to its logicality!)

      <INPUT title=">">

      Also, INPUTs have no required attributes (that is, "<INPUT>" is
      valid), but the "+" quantifier matches *one* or more of whatever came
      before. To over-simplistically match INPUTs, I'd substitute "*" for
      "+". Since you're only wanting to match those INPUTs with explicit
      type, name and value attributes, however, that's inconsequential .
      [color=blue]
      > // inside each "<input ... >" isolate the pairs "attr=value "
      > foreach ($inputs[1] as $input) {
      > // once for double quoted values
      > preg_match_all( '@(([^\s<>]+)\s*=\s*"([^"<>]+)")@', $input, $matches);[/color]

      An SGML name begins with a name start character and is followed by
      zero or more name characters. You'd match a name, for HTML4.01, with
      the pattern

      [a-zA-Z][a-zA-Z0-9.-_:]*

      An attribute value may be of length zero, so, again, the quantifier
      "*" ought to be used. And inside quoted attribute values, both "<"
      and ">" can appear. Alvaro G Vicario has just pointed this out too,
      in an article in the thread "php sticky forms",

      <news:1qih21wt0 xy4e$.1f5ehf0s1 tf5a$.dlg@40tud e.net>.
      [color=blue]
      > // once for single quoted values
      > preg_match_all( '@(([^\s<>]+)\s*=\s*\'([^\'<>]+)\')@', $input, $matches);[/color]

      Ditto.
      [color=blue]
      > // and once again for unquoted values
      > preg_match_all( '@(([^\s<>]+)\s*=\s*([^\s<>"\']+))@', $input, $matches);[/color]

      Unquoted attribute values may only contain name characters. In
      HTML4.01, the pattern

      [a-zA-Z0-9.-_:]*

      matches name characters.

      Phew!

      Refs.:




      --
      Jock

      Comment

      • Pedro Graca

        #4
        Re: Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)

        John Dunlop wrote:[color=blue]
        > Pedro Graca wrote:
        >[color=green]
        >> Although I've heard often enough that RXs are not the best tool for this
        >> job (try a HTML or XML parser) I do very well with them myself :)[/color][/color]
        [color=blue]
        > I'd like to pass a few comments, nevertheless, which might change
        > your mind about regular expressions for parsing (X)HTML.[/color]

        Appreciate it.
        [color=blue]
        > They changed my mind, anyway.[/color]

        Changed my mind, too. Will take a little longer to change my scripts.
        But new scripts will not use regular expressions!
        [color=blue]
        > You'll understand though, hopefully, why I
        > haven't offered any regular expression in place of yours (no, it's
        > not because I couldn't be bothered :-)).[/color]

        Same reason I'm not changing them, I guess :-)
        [color=blue]
        > (Trying to cope with shorthand markup when using regexes would be a
        > nightmare. Unlike proper parsers, I'm going to act like a browser
        > and ignore shorthand markup, for the time being, as it'd complicate
        > matters even more.)[/color]

        Don't even mention that.

        (snip very good content)
        Thank you John. Thank you very much.
        --
        --= my mail box only accepts =--
        --= Content-Type: text/plain =--
        --= Size below 10001 bytes =--

        Comment

        • YoBro

          #5
          Re: Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)

          Any idea of some real life working examples to do it the SGML way. Something
          I have never heard of before.

          The reference links appear to have no relevance to what I am trying to do.

          There is a php function xml_parse, could this be used?
          The documentation is light on that topic.

          Thanks!

          "Pedro Graca" <hexkid@hotpop. com> wrote in message
          news:c2an23$1ra p4q$1@ID-203069.news.uni-berlin.de...[color=blue]
          > John Dunlop wrote:[color=green]
          > > Pedro Graca wrote:
          > >[color=darkred]
          > >> Although I've heard often enough that RXs are not the best tool for[/color][/color][/color]
          this[color=blue][color=green][color=darkred]
          > >> job (try a HTML or XML parser) I do very well with them myself :)[/color][/color]
          >[color=green]
          > > I'd like to pass a few comments, nevertheless, which might change
          > > your mind about regular expressions for parsing (X)HTML.[/color]
          >
          > Appreciate it.
          >[color=green]
          > > They changed my mind, anyway.[/color]
          >
          > Changed my mind, too. Will take a little longer to change my scripts.
          > But new scripts will not use regular expressions!
          >[color=green]
          > > You'll understand though, hopefully, why I
          > > haven't offered any regular expression in place of yours (no, it's
          > > not because I couldn't be bothered :-)).[/color]
          >
          > Same reason I'm not changing them, I guess :-)
          >[color=green]
          > > (Trying to cope with shorthand markup when using regexes would be a
          > > nightmare. Unlike proper parsers, I'm going to act like a browser
          > > and ignore shorthand markup, for the time being, as it'd complicate
          > > matters even more.)[/color]
          >
          > Don't even mention that.
          >
          > (snip very good content)
          > Thank you John. Thank you very much.
          > --
          > --= my mail box only accepts =--
          > --= Content-Type: text/plain =--
          > --= Size below 10001 bytes =--[/color]


          Comment

          • Pedro Graca

            #6
            Re: Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)

            I (Pedro Graca) wrote:[color=blue]
            > Changed my mind, too. Will take a little longer to change my scripts.
            > But new scripts will not use regular expressions![/color]

            Ufffffff. This took longer than I expected.

            The XML parser included with PHP gives errors for many of the pages I
            tested (most of them were HTML pages, so it's understandable :).

            I found a parser for HTML I like @ http://php-html.sourceforge.net/

            #v+
            <?php
            include 'htmlparser.inc .php'; // Yes! I changed the name
            // also changed short php tag

            $contents = file_get_conten ts('http://www.faqs.org/rfcs/index.html');

            $parser = new HtmlParser($con tents);
            while ($parser->parse()) {
            if (strtolower($pa rser->iNodeName) == 'input') {

            #echo "\niNodeTyp e: "; print_r($parser->iNodeType);
            #echo "\niNodeNam e: "; print_r($parser->iNodeName);
            #echo "\niNodeVal ue: "; print_r($parser->iNodeValue);
            echo "\niNodeAttribu tes: "; print_r($parser->iNodeAttribute s);
            }
            }

            echo "\n\nDone!\ n";
            ?>
            #v-

            and the result of this script is:

            iNodeAttributes : Array
            (
            [name] => query
            [size] => 25
            )

            iNodeAttributes : Array
            (
            [type] => submit
            [value] => Search RFCs
            )

            iNodeAttributes : Array
            (
            [name] => display
            [size] => 9
            )

            iNodeAttributes : Array
            (
            [type] => submit
            [value] => Display RFC By Number
            )


            Done!
            --
            --= my mail box only accepts =--
            --= Content-Type: text/plain =--
            --= Size below 10001 bytes =--

            Comment

            • YoBro

              #7
              Re: Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)

              Hi,

              Thanks, that is very helpful.
              I have tried to download this file but my browser keeps crashing when I get
              there.

              I don't suppose if you have a copy you could email it to me?
              ('htmlparser.in c.php')
              to: yobro@wazzup.co .nz.

              YoBro!




              "Pedro Graca" <hexkid@hotpop. com> wrote in message
              news:c2e627$1st v3d$1@ID-203069.news.uni-berlin.de...[color=blue]
              > I (Pedro Graca) wrote:[color=green]
              > > Changed my mind, too. Will take a little longer to change my scripts.
              > > But new scripts will not use regular expressions![/color]
              >
              > Ufffffff. This took longer than I expected.
              >
              > The XML parser included with PHP gives errors for many of the pages I
              > tested (most of them were HTML pages, so it's understandable :).
              >
              > I found a parser for HTML I like @ http://php-html.sourceforge.net/
              >
              > #v+
              > <?php
              > include 'htmlparser.inc .php'; // Yes! I changed the name
              > // also changed short php tag
              >
              > $contents = file_get_conten ts('http://www.faqs.org/rfcs/index.html');
              >
              > $parser = new HtmlParser($con tents);
              > while ($parser->parse()) {
              > if (strtolower($pa rser->iNodeName) == 'input') {
              >
              > #echo "\niNodeTyp e: "; print_r($parser->iNodeType);
              > #echo "\niNodeNam e: "; print_r($parser->iNodeName);
              > #echo "\niNodeVal ue: "; print_r($parser->iNodeValue);
              > echo "\niNodeAttribu tes: "; print_r($parser->iNodeAttribute s);
              > }
              > }
              >
              > echo "\n\nDone!\ n";
              > ?>
              > #v-
              >
              > and the result of this script is:
              >
              > iNodeAttributes : Array
              > (
              > [name] => query
              > [size] => 25
              > )
              >
              > iNodeAttributes : Array
              > (
              > [type] => submit
              > [value] => Search RFCs
              > )
              >
              > iNodeAttributes : Array
              > (
              > [name] => display
              > [size] => 9
              > )
              >
              > iNodeAttributes : Array
              > (
              > [type] => submit
              > [value] => Display RFC By Number
              > )
              >
              >
              > Done!
              > --
              > --= my mail box only accepts =--
              > --= Content-Type: text/plain =--
              > --= Size below 10001 bytes =--[/color]


              Comment

              • Pedro Graca

                #8
                Re: Parsing (X)HTML with regular expressions, or not (was: Help with a regular expression)

                YoBro top-posted:[color=blue]
                > I have tried to download this file but my browser keeps crashing when I get
                > there.
                >
                > I don't suppose if you have a copy you could email it to me?[/color]

                Try here first :)

                --
                --= my mail box only accepts =--
                --= Content-Type: text/plain =--
                --= Size below 10001 bytes =--

                Comment

                Working...