Hello, I am writing a script that calls a URL and reads the resulting
HTML into a function that strips out everthing and returns ONLY the
links, this is so that I can build a link index of various pages.
I have been programming in PHP for over 2 years now and have never
encountered a problem like the one I am having now. To me this seems
like it should be just about the simplest thing in the world, but I
must admit I'm stumped BIG TIME!
For the sake of speed I choose to use preg_match_all to isolate the
links and return them in an array.
I have tried various regular expressions and modifications of the
regular expressions I find in PHP.net and scripts I've found laying
around as well, and have read through everything I can find on them,
including the stuff on PHP.net.
While researching I found an open source Class called snoopy that has
nearly the functionality I want, so like any good programmer, I used
it as a starting point.
The default regular expression that is used in snoppy for this
functionality is
preg_match_all( "'<\s*a\s.*?hre f\s*=\s*([\"\'])?(?(1)
(.*?)\\1|([^\s\>]+))'isx",$docum ent,$links);
For the benefit of all those new to regular expressions here it is
broken down with the authors comments
'<\s*a\s.*?href \s*=\s* # find <a href=
([\"\'])? # find single or double quote
(?(1) (.*?)\\1 | ([^\s\>]+))' # if quote found, match up to next
matching quote, otherwise match up to next space
Of course $document is the complete HTML result of the webpage I am
indexing.
This expression only returns where the link is pointing to.
I need to obtain the complete link from \< \a
href=mysite.com/mypage.html \>My Page</a>
Excuse the extra \ escape characters, I am using google to post and I
don't want it to turn that into an actual link (just hope it works)
Anyways I needed the complete link so I replaced that with this
preg_match_all( '/\<a href.*?\>(.*)(< \/a\\1>)/',$document,$li nks);
Again for those new to regular expressions here goes
'/\<a href.*?\> #Look for <a href
(.*) #Grab everything staring at the first match
(<\/a\\1>)/' #And continue to the < /a > end of the link \\1
tells it to return ONLY that which matches the whole expression.
This appears to work fine except when I run it, I seem to only get the
first 17-20 links on the same webpage, where the first expression may
return over a 100. This told me something might be wrong, so I looked
ALOT closer at both expressions and the pages I'm dealing with and
realized that some of the links may use various case and spacing
combos. The second expression doesn't appear to match anything but
exact spacing & case. So I went back to the drawing board and came up
with this.
preg_match_all( "'<\s*a\s.*?hre f.*?\>(.*)(<\/a\\1>)'",$docum ent,$links);
Again here it is broken down for those new to regular expressions
'<\s*a\s.*?href .*?\> #Find all <a href regardles of case
or spacing
(.*) #Grab everything just matched
(<\/a\\1>) #Find the closing < /a > and stop
Using the same webpage as the first two, this expression only returns
12 results! It actually is returning less than the first two.
Right now I am really mad at regular expressions. Could someone
please not just give me the solution, to the problem, but detail the
thought process to come up with that solution, and show what I'm doing
wrong here so next time I use PCRE functions, I can use correct
thinking.
Look closely at my comments, they are by no means exact, this is how I
BELIEVE the regular expression is being evaluted, I am open to
criticism on that point.
Thanx in advance, and I certainly hope this gets an informative &
instructional thread going for the benefit of everyone new to Regular
Expressions.
HTML into a function that strips out everthing and returns ONLY the
links, this is so that I can build a link index of various pages.
I have been programming in PHP for over 2 years now and have never
encountered a problem like the one I am having now. To me this seems
like it should be just about the simplest thing in the world, but I
must admit I'm stumped BIG TIME!
For the sake of speed I choose to use preg_match_all to isolate the
links and return them in an array.
I have tried various regular expressions and modifications of the
regular expressions I find in PHP.net and scripts I've found laying
around as well, and have read through everything I can find on them,
including the stuff on PHP.net.
While researching I found an open source Class called snoopy that has
nearly the functionality I want, so like any good programmer, I used
it as a starting point.
The default regular expression that is used in snoppy for this
functionality is
preg_match_all( "'<\s*a\s.*?hre f\s*=\s*([\"\'])?(?(1)
(.*?)\\1|([^\s\>]+))'isx",$docum ent,$links);
For the benefit of all those new to regular expressions here it is
broken down with the authors comments
'<\s*a\s.*?href \s*=\s* # find <a href=
([\"\'])? # find single or double quote
(?(1) (.*?)\\1 | ([^\s\>]+))' # if quote found, match up to next
matching quote, otherwise match up to next space
Of course $document is the complete HTML result of the webpage I am
indexing.
This expression only returns where the link is pointing to.
I need to obtain the complete link from \< \a
href=mysite.com/mypage.html \>My Page</a>
Excuse the extra \ escape characters, I am using google to post and I
don't want it to turn that into an actual link (just hope it works)
Anyways I needed the complete link so I replaced that with this
preg_match_all( '/\<a href.*?\>(.*)(< \/a\\1>)/',$document,$li nks);
Again for those new to regular expressions here goes
'/\<a href.*?\> #Look for <a href
(.*) #Grab everything staring at the first match
(<\/a\\1>)/' #And continue to the < /a > end of the link \\1
tells it to return ONLY that which matches the whole expression.
This appears to work fine except when I run it, I seem to only get the
first 17-20 links on the same webpage, where the first expression may
return over a 100. This told me something might be wrong, so I looked
ALOT closer at both expressions and the pages I'm dealing with and
realized that some of the links may use various case and spacing
combos. The second expression doesn't appear to match anything but
exact spacing & case. So I went back to the drawing board and came up
with this.
preg_match_all( "'<\s*a\s.*?hre f.*?\>(.*)(<\/a\\1>)'",$docum ent,$links);
Again here it is broken down for those new to regular expressions
'<\s*a\s.*?href .*?\> #Find all <a href regardles of case
or spacing
(.*) #Grab everything just matched
(<\/a\\1>) #Find the closing < /a > and stop
Using the same webpage as the first two, this expression only returns
12 results! It actually is returning less than the first two.
Right now I am really mad at regular expressions. Could someone
please not just give me the solution, to the problem, but detail the
thought process to come up with that solution, and show what I'm doing
wrong here so next time I use PCRE functions, I can use correct
thinking.
Look closely at my comments, they are by no means exact, this is how I
BELIEVE the regular expression is being evaluted, I am open to
criticism on that point.
Thanx in advance, and I certainly hope this gets an informative &
instructional thread going for the benefit of everyone new to Regular
Expressions.
Comment