Using preg_match_all to retrieve HTML data

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • darknavi
    New Member
    • Apr 2010
    • 3

    Using preg_match_all to retrieve HTML data

    Hey guys, I am trying to make a forum signature generator and I don't have access to the databases, so I was trying to rip info from profile pages. This is the code I am trying to get:

    Code:
    <span title='0.31% of total forum posts'>133 (2.02 per day)</span>
    I want the "133", total posts... this is what I have, I guess it is completely wrong though:

    Code:
    "<span title=\'(+.?)% of total forum posts\'>(+.?) ((+.?) per day)</span>"
    Any help would be AWESOME!
  • Atli
    Recognized Expert Expert
    • Nov 2006
    • 5062

    #2
    Hey.

    I would try something more like this:
    [code=php]'#<span[^>]*>(\d+)[^<]*</span>#i'[/code]
    This should only get the "133" in your example (that's the only number you needed, right?).

    Comment

    • darknavi
      New Member
      • Apr 2010
      • 3

      #3
      Yes, it is. Do you mind explaining how you set that up because I'd really love to learn. :D

      Comment

      • Atli
        Recognized Expert Expert
        • Nov 2006
        • 5062

        #4
        Sure.
        1. All regular expressions need to be enclosed in preselected delimiter characters. I chose # because it won't be used in the expression itself.
          (Using the / char is very popular.)
          [code=regexp]##[/code]
        2. Then I add the basics. In this case, you are looking for a <span>, so I start with that.
          [code=regexp]#<span></span>#[/code]
        3. The <span> you are looking for includes attributes, all of which are irrelevant to what we are searching for. So I add a character class that looks for everything except the > that would close the span tag -- [^>] -- and I tell it to search for any number of that class by adding a asterisk (*) to it.
          [code=regexp]#<span[^>]*></span>#[/code]
        4. We are looking for a number at the beginning of the span's value, so I add a class that searches only for numbers. This would normally be [0-9], but being a very frequently used class, there is a short-hand for it: \d. We are searching for one or more digit, so we use the + operator on the class.
          [code=regexp]#<span[^>]*>\d+</span>#[/code]
        5. Because we want to be able to retrieve the number, we create a group around it by enclosing it in parenthesis. Groups are useful in many ways, but in this case it's purpose is to make PHP capture it's contents and add it to the output array.
          [code=regexp]#<span[^>]*>(\d+)</span>#[/code]
        6. And finally we only need to account for the rest of the span's value, so we add another class to it that searches for anything but the opening < of the closing </span>. Like before, to make it match any number of the char class, we add an asterisk.
          [code=regexp]#<span[^>]*>(\d+)[^<]*</span>#[/code]
        7. Because HTML can be either upper or lower case, I added a i to the expression, after the closing # delimiter. This makes it case-insensitive.
          [code=regexp]#<span[^>]*>(\d+)[^<]*</span>#i[/code]


        To get all the results inside PHP, you would execute the expression using the preg_match_all function.
        [code=php]<?php
        $str = "... lots of HTML from your source ...";
        $regexp = '#<span[^>]*>(\d+)[^<]*</span>#';

        if(preg_match_a ll($regexp, $str, $matches)) {
        for($i = 0; $i < count($matches[0]); $i++) {
        echo "Match #$i = {$matches[1][$i]}\n<br>";
        }
        }
        else {
        echo "No matches.";
        }
        ?>[/code]

        Check out regular-exressions.info if you are interested in learning more about regular expression. It's tricky to learn, but well worth it.

        Hope that made sense :)

        Comment

        • darknavi
          New Member
          • Apr 2010
          • 3

          #5
          Thank you VERY much. The best explanation of this stuff I have found on the internet.

          Comment

          Working...