A challenge? Help isolating links in a WebPage

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Steve

    A challenge? Help isolating links in a WebPage

    Hello, I am writing a script that calls a URL and reads the resulting
    HTML into a function that strips out everthing and returns ONLY the
    links, this is so that I can build a link index of various pages.
    I have been programming in PHP for over 2 years now and have never
    encountered a problem like the one I am having now. To me this seems
    like it should be just about the simplest thing in the world, but I
    must admit I'm stumped BIG TIME!
    For the sake of speed I choose to use preg_match_all to isolate the
    links and return them in an array.
    I have tried various regular expressions and modifications of the
    regular expressions I find in PHP.net and scripts I've found laying
    around as well, and have read through everything I can find on them,
    including the stuff on PHP.net.
    While researching I found an open source Class called snoopy that has
    nearly the functionality I want, so like any good programmer, I used
    it as a starting point.
    The default regular expression that is used in snoppy for this
    functionality is

    preg_match_all( "'<\s*a\s.*?hre f\s*=\s*([\"\'])?(?(1)
    (.*?)\\1|([^\s\>]+))'isx",$docum ent,$links);

    For the benefit of all those new to regular expressions here it is
    broken down with the authors comments

    '<\s*a\s.*?href \s*=\s* # find <a href=
    ([\"\'])? # find single or double quote
    (?(1) (.*?)\\1 | ([^\s\>]+))' # if quote found, match up to next
    matching quote, otherwise match up to next space



    Of course $document is the complete HTML result of the webpage I am
    indexing.

    This expression only returns where the link is pointing to.

    I need to obtain the complete link from \< \a
    href=mysite.com/mypage.html \>My Page</a>
    Excuse the extra \ escape characters, I am using google to post and I
    don't want it to turn that into an actual link (just hope it works)

    Anyways I needed the complete link so I replaced that with this

    preg_match_all( '/\<a href.*?\>(.*)(< \/a\\1>)/',$document,$li nks);

    Again for those new to regular expressions here goes
    '/\<a href.*?\> #Look for <a href
    (.*) #Grab everything staring at the first match
    (<\/a\\1>)/' #And continue to the < /a > end of the link \\1
    tells it to return ONLY that which matches the whole expression.


    This appears to work fine except when I run it, I seem to only get the
    first 17-20 links on the same webpage, where the first expression may
    return over a 100. This told me something might be wrong, so I looked
    ALOT closer at both expressions and the pages I'm dealing with and
    realized that some of the links may use various case and spacing
    combos. The second expression doesn't appear to match anything but
    exact spacing & case. So I went back to the drawing board and came up
    with this.
    preg_match_all( "'<\s*a\s.*?hre f.*?\>(.*)(<\/a\\1>)'",$docum ent,$links);


    Again here it is broken down for those new to regular expressions
    '<\s*a\s.*?href .*?\> #Find all <a href regardles of case
    or spacing
    (.*) #Grab everything just matched
    (<\/a\\1>) #Find the closing < /a > and stop

    Using the same webpage as the first two, this expression only returns
    12 results! It actually is returning less than the first two.

    Right now I am really mad at regular expressions. Could someone
    please not just give me the solution, to the problem, but detail the
    thought process to come up with that solution, and show what I'm doing
    wrong here so next time I use PCRE functions, I can use correct
    thinking.

    Look closely at my comments, they are by no means exact, this is how I
    BELIEVE the regular expression is being evaluted, I am open to
    criticism on that point.

    Thanx in advance, and I certainly hope this gets an informative &
    instructional thread going for the benefit of everyone new to Regular
    Expressions.
  • Hartmut König

    #2
    Re: A challenge? Help isolating links in a WebPage

    Steve wrote:
    [color=blue]
    > Hello, I am writing a script that calls a URL and reads the resulting
    > HTML into a function that strips out everthing and returns ONLY the
    > links, this is so that I can build a link index of various pages.
    > I have been programming in PHP for over 2 years now and have never
    > encountered a problem like the one I am having now. To me this seems
    > like it should be just about the simplest thing in the world, but I
    > must admit I'm stumped BIG TIME![/color]

    Why don't you do yourself a favour and use HTMLSax from Pear:


    Regards

    Hartmut

    --
    SnakeLab - Internet und webbasierte Software | /\ /

    Hartmut König (mailto:h.koeni g@snakelab.de) | /\/ \ /

    ___________ http://www.snakelab.de _______/\/\| /\/ \/

    Do you know your Shop-clients ? ShopStat do ->\/_____________

    Comment

    • Steve

      #3
      Re: A challenge? Help isolating links in a WebPage

      Well two reasons really.
      First off I didn't know this thing existed, and I'll now probably have
      to learn a new API :)
      And the second was to get a good discussion going on PCRE's
      preg_match_all regular expressions.
      But thank you, and I will take a closer look, since I'm running out of
      dev time waiting on this one part.

      Hartmut König <h.koenig@snake lab.de> wrote in message news:<bnj0fk$tr 4$04$1@news.t-online.com>...[color=blue]
      > Steve wrote:
      >[color=green]
      > > Hello, I am writing a script that calls a URL and reads the resulting
      > > HTML into a function that strips out everthing and returns ONLY the
      > > links, this is so that I can build a link index of various pages.
      > > I have been programming in PHP for over 2 years now and have never
      > > encountered a problem like the one I am having now. To me this seems
      > > like it should be just about the simplest thing in the world, but I
      > > must admit I'm stumped BIG TIME![/color]
      >
      > Why don't you do yourself a favour and use HTMLSax from Pear:
      > http://pear.php.net/package-info.php...ge=XML_HTMLSax
      >
      > Regards
      >
      > Hartmut[/color]

      Comment

      • Steve

        #4
        Re: A challenge? Help isolating links in a WebPage

        I just downloaded it, and took a peek.
        It's totally over kill for what I need, and I would have to recode
        from the beginning to utilize it. I will however be using it on my
        next project.
        Also as stated earlier, I really want to do this with a regular
        expression.

        Hartmut König <h.koenig@snake lab.de> wrote in message news:<bnj0fk$tr 4$04$1@news.t-online.com>...[color=blue]
        > Steve wrote:
        >[color=green]
        > > Hello, I am writing a script that calls a URL and reads the resulting
        > > HTML into a function that strips out everthing and returns ONLY the
        > > links, this is so that I can build a link index of various pages.
        > > I have been programming in PHP for over 2 years now and have never
        > > encountered a problem like the one I am having now. To me this seems
        > > like it should be just about the simplest thing in the world, but I
        > > must admit I'm stumped BIG TIME![/color]
        >
        > Why don't you do yourself a favour and use HTMLSax from Pear:
        > http://pear.php.net/package-info.php...ge=XML_HTMLSax
        >
        > Regards
        >
        > Hartmut[/color]

        Comment

        • Pedro

          #5
          Re: A challenge? Help isolating links in a WebPage

          Steve wrote:[color=blue]
          > I just downloaded it, and took a peek.
          > It's totally over kill for what I need, and I would have to recode
          > from the beginning to utilize it. I will however be using it on my
          > next project.
          > Also as stated earlier, I really want to do this with a regular
          > expression.[/color]

          I used a mix of preg_match_all( ) and substr().

          source at http://www.geocities.com/alterpedro/phps.html
          result at http://www.geocities.com/alterpedro/php.html

          --
          I have a spam filter working.
          To mail me include "urkxvq" (with or without the quotes)
          in the subject line, or your mail will be ruthlessly discarded.

          Comment

          • Pedro

            #6
            Re: A challenge? Help isolating links in a WebPage

            Pedro wrote:[color=blue]
            > I used a mix of preg_match_all( ) and substr().[/color]
            and strpos(), and preg_replace()
            [color=blue]
            > source at http://www.geocities.com/alterpedro/phps.html
            > result at http://www.geocities.com/alterpedro/php.html[/color]

            I have pasted the code to geocities, because it was much bigger
            than I felt "safe" to post here. Much of its size was the
            yahoo HTML chunk that I had to remove before the file
            got accepted ... but I had thought of that and didn't want
            to go back to some other way. Now I'm home, thinking clearer,
            and the code is better :)



            New Version!
            [ I'll remove the geocities pages in a few days ]



            <?php
            function extract_URLs($s ) {
            $res = array();
            preg_match_all( '@(<a .*</a>)@Uis', $s, $a);
            foreach ($a[1] as $x) {
            $gtpos = strpos($x, '>');
            $y = substr($x, 0, $gtpos);
            if ($hrefpos = strpos($x, 'href=')) {
            $z = substr($y, $hrefpos+5);
            $z = preg_replace('/^(\S+)\s.*$/U', '$1', $z);
            if ($z[0] == '"' && substr($z, -1) == '"') $z = substr($z, 1, -1);
            if ($z[0] == "'" && substr($z, -1) == "'") $z = substr($z, 1, -1);
            $res[] = array(substr($x , $gtpos+1, -4), $z);
            }
            }
            unset($a);
            return $res;
            }


            ###
            ### example usage:
            ###

            $data = <<<EOT
            <a href=z>zz</a> <a href="z" bold="yes">ZZ</a>
            <a link="y">yy</a> <a title="x" href='aa'>aa</a>
            text before, <a href="href.here "><b>bold text inside</b></a> and text after
            <a href="image.png "><img src="image.png"/></a>
            EOT;

            $LINKS = extract_URLs($d ata);
            foreach ($LINKS as $v) {
            echo $v[0], ' --> [', $v[1], "]\n";
            }
            ?>

            :x
            --
            I have a spam filter working.
            To mail me include "urkxvq" (with or without the quotes)
            in the subject line, or your mail will be ruthlessly discarded.

            Comment

            • Steve

              #7
              Re: A challenge? Help isolating links in a WebPage

              Pedro <hexkid@hotpop. com> wrote in message news:<bnk0li$11 scd7$1@ID-203069.news.uni-berlin.de>...[color=blue]
              > Pedro wrote:[color=green]
              > > I used a mix of preg_match_all( ) and substr().[/color]
              > and strpos(), and preg_replace()
              >[color=green]
              > > source at http://www.geocities.com/alterpedro/phps.html
              > > result at http://www.geocities.com/alterpedro/php.html[/color]
              >
              > I have pasted the code to geocities, because it was much bigger
              > than I felt "safe" to post here. Much of its size was the
              > yahoo HTML chunk that I had to remove before the file
              > got accepted ... but I had thought of that and didn't want
              > to go back to some other way. Now I'm home, thinking clearer,
              > and the code is better :)
              >
              >
              >
              > New Version!
              > [ I'll remove the geocities pages in a few days ]
              >
              >
              >
              > <?php
              > function extract_URLs($s ) {
              > $res = array();
              > preg_match_all( '@(<a .*</a>)@Uis', $s, $a);
              > foreach ($a[1] as $x) {
              > $gtpos = strpos($x, '>');
              > $y = substr($x, 0, $gtpos);
              > if ($hrefpos = strpos($x, 'href=')) {
              > $z = substr($y, $hrefpos+5);
              > $z = preg_replace('/^(\S+)\s.*$/U', '$1', $z);
              > if ($z[0] == '"' && substr($z, -1) == '"') $z = substr($z, 1, -1);
              > if ($z[0] == "'" && substr($z, -1) == "'") $z = substr($z, 1, -1);
              > $res[] = array(substr($x , $gtpos+1, -4), $z);
              > }
              > }
              > unset($a);
              > return $res;
              > }
              >
              >
              > ###
              > ### example usage:
              > ###
              >
              > $data = <<<EOT
              > <a href=z>zz</a> <a href="z" bold="yes">ZZ</a>
              > <a link="y">yy</a> <a title="x" href='aa'>aa</a>
              > text before, <a href="href.here "><b>bold text inside</b></a> and text after
              > <a href="image.png "><img src="image.png"/></a>
              > EOT;
              >
              > $LINKS = extract_URLs($d ata);
              > foreach ($LINKS as $v) {
              > echo $v[0], ' --> [', $v[1], "]\n";
              > }
              > ?>
              >
              > :x[/color]

              Looks great, and it does basically what I want...
              Care to explain to the class how it works? Especially the regular expression part?

              And thanx by the way.

              Comment

              • Pedro

                #8
                Re: A challenge? Help isolating links in a WebPage

                Steve wrote:[color=blue]
                > Looks great, and it does basically what I want...
                > Care to explain to the class how it works? Especially the regular
                > expression part?
                >
                > And thanx by the way.
                >[/color]

                Let's see how I go about that ... hope it makes sense :)


                # extract URLs from a string return an array of arrays;
                # each inner array has the text and the URL
                function extract_URLs($s ) {


                # initialize return array
                $res = array();


                # grab all "<a ...</a>" bits
                preg_match_all( '@(<a\s.*</a>)@Uis', $s, $a);
                # |`----v-----'|||`- dot metacharacter matches all (\n
                # included)
                # | | ||`-- case insensitive matches
                # | | |`--- ungreedy, so that '<a
                # href="1">1</a><a href="2">2</a>'
                # | | | does *NOT* match \___
                # all_of_this _______________ _/
                # | | `---- end pattern delimiter
                # | `----------- grab into $a[1]
                # `----------------- pattern delimiter
                #
                # for the pattern inside the parenthesis:
                # <a\s literal "<a" followed by whitespace, which stops the regex
                # from matching "abbr", "acronym", "address", "applet", and
                # "area"
                # .* any number of anything
                # (except "</a>" because we're in ungreedy matching)
                # </a> literal "</a>"


                # for all "<a ...</a>" matches
                foreach ($a[1] as $x) {


                # find the first ">" -- certainly it is the one that ends the opening "<a "
                $gtpos = strpos($x, '>');


                # and isolate that part
                $y = substr($x, 0, $gtpos);


                # if there's a "href=" there we have a good match!
                # get rid of "title" in <a title="index" href="index.htm l">
                if ($hrefpos = strpos($y, 'href=')) {


                # put the URL, and trailing stuff (up to, but not including, the closing ">"), in $z
                $z = substr($y, $hrefpos+5);


                # remove everything after, and including, the first whitespace
                # (whitespace is not allowed in URLs)
                # get rid of "title" in <a href="index.htm l" title="index">
                # if there's no match, there also is no change
                $z = preg_replace('/^(\S+)\s.*$/U', '$1', $z);
                # / start or expression
                # ^ start of string
                # ( grab
                # \S+ one or more non whitespace characters
                # ) into $1
                # \s discard the first whitespace
                # .* and everthing following it
                # $ up to the end of the string
                # /U end of expression, do ungreedy match (why? I can't remember :)


                # it the URL is delimited by '"' or "'" remove those
                if ($z[0] == '"' && substr($z, -1) == '"') $z = substr($z, 1, -1);
                if ($z[0] == "'" && substr($z, -1) == "'") $z = substr($z, 1, -1);


                # save result in array
                # $x still is the whole string "<a href='index.htm l' title='index'>l ink text</a>"
                # $gtpos is the position of the first ">": _______________ ________^__
                # and the last 4 charcaters of $x are "</a>"
                #
                # $z is the URL from the href that has been dealt with previously
                $res[] = array(substr($x , $gtpos+1, -4), $z);
                }
                }


                # I don't like leaving "large" things abandoned
                unset($a);

                return $res;
                }



                compact new version function:

                <?php
                function extract_URLs($s ) {
                ### version 3
                ### changes from version 2:
                ### the character separating "<a" from "href" (or whatever) may be any whitespace
                ### only need to test for "href=" in the <a ...> part
                $res = array();
                preg_match_all( '@(<a\s.*</a>)@Uis', $s, $a);
                foreach ($a[1] as $x) {
                $gtpos = strpos($x, '>');
                $y = substr($x, 0, $gtpos);
                if ($hrefpos = strpos($y, 'href=')) {
                $z = substr($y, $hrefpos+5);
                $z = preg_replace('/^(\S+)\s.*$/U', '$1', $z);
                if ($z[0] == '"' && substr($z, -1) == '"') $z = substr($z, 1, -1);
                if ($z[0] == "'" && substr($z, -1) == "'") $z = substr($z, 1, -1);
                $res[] = array(substr($x , $gtpos+1, -4), $z);
                }
                }
                unset($a);
                return $res;
                }
                ?>


                --
                I have a spam filter working.
                To mail me include "urkxvq" (with or without the quotes)
                in the subject line, or your mail will be ruthlessly discarded.

                Comment

                • Steve

                  #9
                  Re: A challenge? Help isolating links in a WebPage

                  Pedro <hexkid@hotpop. com> wrote in message news:<bnli99$12 ko2t$1@ID-203069.news.uni-berlin.de>...[color=blue]
                  > Steve wrote:[color=green]
                  > > Looks great, and it does basically what I want...
                  > > Care to explain to the class how it works? Especially the regular
                  > > expression part?
                  > >
                  > > And thanx by the way.
                  > >[/color]
                  >
                  > Let's see how I go about that ... hope it makes sense :)
                  >
                  >
                  > # extract URLs from a string return an array of arrays;
                  > # each inner array has the text and the URL
                  > function extract_URLs($s ) {
                  >
                  >
                  > # initialize return array
                  > $res = array();
                  >
                  >
                  > # grab all "<a ...</a>" bits
                  > preg_match_all( '@(<a\s.*</a>)@Uis', $s, $a);
                  > # |`----v-----'|||`- dot metacharacter matches all (\n
                  > # included)
                  > # | | ||`-- case insensitive matches
                  > # | | |`--- ungreedy, so that '<a
                  > # href="1">1</a><a href="2">2</a>'
                  > # | | | does *NOT* match \___
                  > # all_of_this _______________ _/
                  > # | | `---- end pattern delimiter
                  > # | `----------- grab into $a[1]
                  > # `----------------- pattern delimiter
                  > #
                  > # for the pattern inside the parenthesis:
                  > # <a\s literal "<a" followed by whitespace, which stops the regex
                  > # from matching "abbr", "acronym", "address", "applet", and
                  > # "area"
                  > # .* any number of anything
                  > # (except "</a>" because we're in ungreedy matching)
                  > # </a> literal "</a>"
                  >
                  >
                  > # for all "<a ...</a>" matches
                  > foreach ($a[1] as $x) {
                  >
                  >
                  > # find the first ">" -- certainly it is the one that ends the opening "<a "
                  > $gtpos = strpos($x, '>');
                  >
                  >
                  > # and isolate that part
                  > $y = substr($x, 0, $gtpos);
                  >
                  >
                  > # if there's a "href=" there we have a good match!
                  > # get rid of "title" in <a title="index" href="index.htm l">
                  > if ($hrefpos = strpos($y, 'href=')) {
                  >
                  >
                  > # put the URL, and trailing stuff (up to, but not including, the closing ">"), in $z
                  > $z = substr($y, $hrefpos+5);
                  >
                  >
                  > # remove everything after, and including, the first whitespace
                  > # (whitespace is not allowed in URLs)
                  > # get rid of "title" in <a href="index.htm l" title="index">
                  > # if there's no match, there also is no change
                  > $z = preg_replace('/^(\S+)\s.*$/U', '$1', $z);
                  > # / start or expression
                  > # ^ start of string
                  > # ( grab
                  > # \S+ one or more non whitespace characters
                  > # ) into $1
                  > # \s discard the first whitespace
                  > # .* and everthing following it
                  > # $ up to the end of the string
                  > # /U end of expression, do ungreedy match (why? I can't remember :)
                  >
                  >
                  > # it the URL is delimited by '"' or "'" remove those
                  > if ($z[0] == '"' && substr($z, -1) == '"') $z = substr($z, 1, -1);
                  > if ($z[0] == "'" && substr($z, -1) == "'") $z = substr($z, 1, -1);
                  >
                  >
                  > # save result in array
                  > # $x still is the whole string "<a href='index.htm l' title='index'>l ink text</a>"
                  > # $gtpos is the position of the first ">": _______________ ________^__
                  > # and the last 4 charcaters of $x are "</a>"
                  > #
                  > # $z is the URL from the href that has been dealt with previously
                  > $res[] = array(substr($x , $gtpos+1, -4), $z);
                  > }
                  > }
                  >
                  >
                  > # I don't like leaving "large" things abandoned
                  > unset($a);
                  >
                  > return $res;
                  > }
                  >
                  >
                  >
                  > compact new version function:
                  >
                  > <?php
                  > function extract_URLs($s ) {
                  > ### version 3
                  > ### changes from version 2:
                  > ### the character separating "<a" from "href" (or whatever) may be any whitespace
                  > ### only need to test for "href=" in the <a ...> part
                  > $res = array();
                  > preg_match_all( '@(<a\s.*</a>)@Uis', $s, $a);
                  > foreach ($a[1] as $x) {
                  > $gtpos = strpos($x, '>');
                  > $y = substr($x, 0, $gtpos);
                  > if ($hrefpos = strpos($y, 'href=')) {
                  > $z = substr($y, $hrefpos+5);
                  > $z = preg_replace('/^(\S+)\s.*$/U', '$1', $z);
                  > if ($z[0] == '"' && substr($z, -1) == '"') $z = substr($z, 1, -1);
                  > if ($z[0] == "'" && substr($z, -1) == "'") $z = substr($z, 1, -1);
                  > $res[] = array(substr($x , $gtpos+1, -4), $z);
                  > }
                  > }
                  > unset($a);
                  > return $res;
                  > }
                  > ?>
                  >
                  >
                  > --
                  > I have a spam filter working.
                  > To mail me include "urkxvq" (with or without the quotes)
                  > in the subject line, or your mail will be ruthlessly discarded.[/color]

                  <--Gives Pedro a Gold Star and says, Thank You for that detailed
                  report, you get a Gold Star!

                  Comment

                  • Pedro

                    #10
                    Re: A challenge? Help isolating links in a WebPage

                    Steve wrote:[color=blue]
                    > Thank You for that detailed report.[/color]

                    You're very welcome.

                    --
                    I have a spam filter working.
                    To mail me include "urkxvq" (with or without the quotes)
                    in the subject line, or your mail will be ruthlessly discarded.

                    Comment

                    Working...