regular expression for parsing html using preg_match_all

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • crescent_au@yahoo.com

    regular expression for parsing html using preg_match_all

    Hi all,

    I've been trying unsuccessfully to get the text from html page. Html
    tag that I'm interested in looks like this:

    <a class=link
    href="http://www.something.c om/_something.php? type=cart">Shop ping
    Cart</a>
    <div><em class=newentry> <a href=http://nothing.com>New
    Age</a></em></div>
    >From the above tag, I want to extract "Shopping Cart". I'm not very
    good with RE. I tried this:
    $lines = file_get_conten ts("http://theabovetag.com/page.html");
    preg_match_all( "/(<a\ class\=link\ href\=(.*)>)(<\/a>)/", $lines,
    $matches1);

    The above RE gives me "Shopping Cart" plus "New Age" as well. I just
    want "Shopping Cart". What am I doing wrong? My RE is somehow ignoring
    </atag right after Shopping Cart and instead accepting </aafter New
    Age. Please help!

  • Richard Levasseur

    #2
    Re: regular expression for parsing html using preg_match_all


    crescent_au@yah oo.com wrote:
    Hi all,
    >
    I've been trying unsuccessfully to get the text from html page. Html
    tag that I'm interested in looks like this:
    >
    <a class=link
    href="http://www.something.c om/_something.php? type=cart">Shop ping
    Cart</a>
    <div><em class=newentry> <a href=http://nothing.com>New
    Age</a></em></div>
    >
    From the above tag, I want to extract "Shopping Cart". I'm not very
    good with RE. I tried this:
    $lines = file_get_conten ts("http://theabovetag.com/page.html");
    preg_match_all( "/(<a\ class\=link\ href\=(.*)>)(<\/a>)/", $lines,
    $matches1);
    >
    The above RE gives me "Shopping Cart" plus "New Age" as well. I just
    want "Shopping Cart". What am I doing wrong? My RE is somehow ignoring
    </atag right after Shopping Cart and instead accepting </aafter New
    Age. Please help!
    It most likely has to do with the greediness of *. Regular expressions
    will match the *longest* possible string. To prevent this, use '?'.
    given the string: "<a>text</a>more</a>"
    <a>.*</amatches "<a>text</a>more</a>"
    <a>.*?</amatches "<a>text</a>"

    Comment

    • crescent_au@yahoo.com

      #3
      Re: regular expression for parsing html using preg_match_all

      It most likely has to do with the greediness of *. Regular expressions
      will match the *longest* possible string. To prevent this, use '?'.
      given the string: "<a>text</a>more</a>"
      <a>.*</amatches "<a>text</a>more</a>"
      <a>.*?</amatches "<a>text</a>"
      Well what i basically want is:
      <a class="somethin g" href=http://something.com/abc.php">Shoppi ng
      Cart</a>

      I want the RE to parse the HTML tag and see if it starts with '<a
      class="somethin g" href=', then IGNORE whatever is between 'href=' and
      '>', and ending with '</a>'. I couldn't figure out how to "ignore" the
      text in between.

      Comment

      • Karel de Vos

        #4
        Re: regular expression for parsing html using preg_match_all

        crescent_au@yah oo.com wrote:
        >It most likely has to do with the greediness of *. Regular expressions
        >will match the *longest* possible string. To prevent this, use '?'.
        >given the string: "<a>text</a>more</a>"
        ><a>.*</amatches "<a>text</a>more</a>"
        ><a>.*?</amatches "<a>text</a>"
        >
        Well what i basically want is:
        <a class="somethin g" href=http://something.com/abc.php">Shoppi ng
        Cart</a>
        >
        I want the RE to parse the HTML tag and see if it starts with '<a
        class="somethin g" href=', then IGNORE whatever is between 'href=' and
        '>', and ending with '</a>'. I couldn't figure out how to "ignore" the
        text in between.
        >
        Instead of a (greedy) * operator, use a negation class that parses
        everything upto an certain character :
        /(<a\ class\=link\ href\=([^>]*)>)([^<]*)(<\/a>)/

        Comment

        Working...