what is wrong with my script.

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • ulnanewbie
    New Member
    • Feb 2010
    • 5

    what is wrong with my script.

    Im using the below to extract the text between all the <br></br>.

    But it does not prints out all text and prints the normal text which is not a part of html link tag.

    Example, if you have <a href="test.html " ><b>The Testing Page is here</b></a>
    <b> extrat text</b>
    I want to extract only - "The Testing Page is here"



    Here variable $myfile

    Here variable $myfile contains the whole HTML page
    Code:
    while ($myfile =~ /<br.+?>(.*)<\/br>/xg) 
     {print ("a");
     print $1;
     }
    Can some one help me out, what I am doing wrong here?

    More Information, I am trying to extract all the text which is a link in the given HTML page.
    Last edited by numberwhun; Feb 8 '10, 03:09 AM. Reason: Please use code tags!
  • modmans2ndcoming
    New Member
    • Sep 2008
    • 11

    #2
    Do you mean bold tags rather than breakline

    There is no breakline tag in your example and the breakline does not have a closing tag, it is self closing.... I will assume you mean the bold tag.

    The way you have written your regex, it is looking for a breakline tag so right off the bat, that needs to be fixed.

    Furthermore, the way you have it written, it will only pickup on a pattern that contains a URL text between bold tags. Not very flexible.

    the pattern you want to look for is anchor tag, followed by 0 or more tags which is followed by alphanumeric characters of any length and ends when you hit the open bracket of a tag.

    but even with that, there is a problem if a tag is embeded in the middle of a sentence used as the link text. I'll leave that to you to figure out though, if you care to.

    Comment

    • numberwhun
      Recognized Expert Moderator Specialist
      • May 2007
      • 3467

      #3
      You need to really examine what you are telling your code to extract and what you actually have in your data.

      You are telling it to match everything between <br> and </br>, but those tags do not exist in your example. Instead, remove the 'r' and try matching the <b> </b> tag set.

      Regards,

      Jeff

      Comment

      • nithinpes
        Recognized Expert Contributor
        • Dec 2007
        • 410

        #4
        If you use :
        Code:
        $myfile =~ /<b>(.*)<\/b>/xg
        $1 would have "The Testing Page is here</b></a>
        <b> extrat text".
        This is because of the greedy nature of * quantifier. To limit this behaviour in order to match minimum number of characters before finding a </b>, use:
        Code:
        $myfile =~ /<b>(.*?)<\/b>/xg

        Comment

        Working...