Regex question

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • remy rakic

    Regex question

    Hi all, i was trying to parse some HTML and found myself in trouble with
    some regex processing (which i have never done before).

    What i am trying to do is to get content between two tags, including any
    html code. I can do stuff like this:
    "<a>([\w\s]*)</a>" on "<a>Not cool</a><a>Absolutel y not</a>" obviously only
    gets regular text content but no html tags, i wonder if someone could
    enlighten me on which regex to use in order to get results "<really>Re ally
    not<cool/><at>all</at>" and "Absolutely not" on the string
    "<tag><tag2><a> <really>Reall y
    not<cool/><at>all</at></a></tag2>...<tag3>< a>Absolutely
    not</a></tag3></tag>" ? (Notice i can't use Xpath since i'm not sure whether
    the site is XHTML compliant or not (as the example is no xml))

    Should i process the content twice, or give up the regex approach for a
    regular 'string index' parsing?
    Thanks in advance


  • Ron Bullman

    #2
    Re: Regex question

    remy,

    How bout <a>(?<1>.+?)</a>


    Ron
    "remy rakic" <liquid@spamhol e.com> wrote in message
    news:ea5aHMmUDH A.2272@TK2MSFTN GP11.phx.gbl...[color=blue]
    > Hi all, i was trying to parse some HTML and found myself in trouble with
    > some regex processing (which i have never done before).
    >
    > What i am trying to do is to get content between two tags, including any
    > html code. I can do stuff like this:
    > "<a>([\w\s]*)</a>" on "<a>Not cool</a><a>Absolutel y not</a>" obviously[/color]
    only[color=blue]
    > gets regular text content but no html tags, i wonder if someone could
    > enlighten me on which regex to use in order to get results "<really>Re ally
    > not<cool/><at>all</at>" and "Absolutely not" on the string
    > "<tag><tag2><a> <really>Reall y
    > not<cool/><at>all</at></a></tag2>...<tag3>< a>Absolutely
    > not</a></tag3></tag>" ? (Notice i can't use Xpath since i'm not sure[/color]
    whether[color=blue]
    > the site is XHTML compliant or not (as the example is no xml))
    >
    > Should i process the content twice, or give up the regex approach for a
    > regular 'string index' parsing?
    > Thanks in advance
    >
    >[/color]


    Comment

    • Ron Bullman

      #3
      Re: Regex question

      remy,

      How bout <a>(?<1>.+?)</a>


      Ron
      "remy rakic" <liquid@spamhol e.com> wrote in message
      news:ea5aHMmUDH A.2272@TK2MSFTN GP11.phx.gbl...[color=blue]
      > Hi all, i was trying to parse some HTML and found myself in trouble with
      > some regex processing (which i have never done before).
      >
      > What i am trying to do is to get content between two tags, including any
      > html code. I can do stuff like this:
      > "<a>([\w\s]*)</a>" on "<a>Not cool</a><a>Absolutel y not</a>" obviously[/color]
      only[color=blue]
      > gets regular text content but no html tags, i wonder if someone could
      > enlighten me on which regex to use in order to get results "<really>Re ally
      > not<cool/><at>all</at>" and "Absolutely not" on the string
      > "<tag><tag2><a> <really>Reall y
      > not<cool/><at>all</at></a></tag2>...<tag3>< a>Absolutely
      > not</a></tag3></tag>" ? (Notice i can't use Xpath since i'm not sure[/color]
      whether[color=blue]
      > the site is XHTML compliant or not (as the example is no xml))
      >
      > Should i process the content twice, or give up the regex approach for a
      > regular 'string index' parsing?
      > Thanks in advance
      >
      >[/color]


      Comment

      • Ron Bullman

        #4
        Re: Regex question

        remy,

        How bout <a>(?<1>.+?)</a>


        Ron
        "remy rakic" <liquid@spamhol e.com> wrote in message
        news:ea5aHMmUDH A.2272@TK2MSFTN GP11.phx.gbl...[color=blue]
        > Hi all, i was trying to parse some HTML and found myself in trouble with
        > some regex processing (which i have never done before).
        >
        > What i am trying to do is to get content between two tags, including any
        > html code. I can do stuff like this:
        > "<a>([\w\s]*)</a>" on "<a>Not cool</a><a>Absolutel y not</a>" obviously[/color]
        only[color=blue]
        > gets regular text content but no html tags, i wonder if someone could
        > enlighten me on which regex to use in order to get results "<really>Re ally
        > not<cool/><at>all</at>" and "Absolutely not" on the string
        > "<tag><tag2><a> <really>Reall y
        > not<cool/><at>all</at></a></tag2>...<tag3>< a>Absolutely
        > not</a></tag3></tag>" ? (Notice i can't use Xpath since i'm not sure[/color]
        whether[color=blue]
        > the site is XHTML compliant or not (as the example is no xml))
        >
        > Should i process the content twice, or give up the regex approach for a
        > regular 'string index' parsing?
        > Thanks in advance
        >
        >[/color]


        Comment

        • remy rakic

          #5
          Re: Regex question

          Aaah the non greedy option, now i know what it is used for. Thx ron, it
          works like a charm !

          "Ron Bullman" <ron.bulman@mai l.com> wrote in message
          news:O5wWmeqUDH A.2156@TK2MSFTN GP11.phx.gbl...[color=blue]
          > remy,
          >
          > How bout <a>(?<1>.+?)</a>
          >
          >
          > Ron
          > "remy rakic" <liquid@spamhol e.com> wrote in message
          > news:ea5aHMmUDH A.2272@TK2MSFTN GP11.phx.gbl...[color=green]
          > > Hi all, i was trying to parse some HTML and found myself in trouble with
          > > some regex processing (which i have never done before).
          > >
          > > What i am trying to do is to get content between two tags, including any
          > > html code. I can do stuff like this:
          > > "<a>([\w\s]*)</a>" on "<a>Not cool</a><a>Absolutel y not</a>" obviously[/color]
          > only[color=green]
          > > gets regular text content but no html tags, i wonder if someone could
          > > enlighten me on which regex to use in order to get results[/color][/color]
          "<really>Re ally[color=blue][color=green]
          > > not<cool/><at>all</at>" and "Absolutely not" on the string
          > > "<tag><tag2><a> <really>Reall y
          > > not<cool/><at>all</at></a></tag2>...<tag3>< a>Absolutely
          > > not</a></tag3></tag>" ? (Notice i can't use Xpath since i'm not sure[/color]
          > whether[color=green]
          > > the site is XHTML compliant or not (as the example is no xml))
          > >
          > > Should i process the content twice, or give up the regex approach for a
          > > regular 'string index' parsing?
          > > Thanks in advance
          > >
          > >[/color]
          >
          >[/color]


          Comment

          • remy rakic

            #6
            Re: Regex question

            Aaah the non greedy option, now i know what it is used for. Thx ron, it
            works like a charm !

            "Ron Bullman" <ron.bulman@mai l.com> wrote in message
            news:O5wWmeqUDH A.2156@TK2MSFTN GP11.phx.gbl...[color=blue]
            > remy,
            >
            > How bout <a>(?<1>.+?)</a>
            >
            >
            > Ron
            > "remy rakic" <liquid@spamhol e.com> wrote in message
            > news:ea5aHMmUDH A.2272@TK2MSFTN GP11.phx.gbl...[color=green]
            > > Hi all, i was trying to parse some HTML and found myself in trouble with
            > > some regex processing (which i have never done before).
            > >
            > > What i am trying to do is to get content between two tags, including any
            > > html code. I can do stuff like this:
            > > "<a>([\w\s]*)</a>" on "<a>Not cool</a><a>Absolutel y not</a>" obviously[/color]
            > only[color=green]
            > > gets regular text content but no html tags, i wonder if someone could
            > > enlighten me on which regex to use in order to get results[/color][/color]
            "<really>Re ally[color=blue][color=green]
            > > not<cool/><at>all</at>" and "Absolutely not" on the string
            > > "<tag><tag2><a> <really>Reall y
            > > not<cool/><at>all</at></a></tag2>...<tag3>< a>Absolutely
            > > not</a></tag3></tag>" ? (Notice i can't use Xpath since i'm not sure[/color]
            > whether[color=green]
            > > the site is XHTML compliant or not (as the example is no xml))
            > >
            > > Should i process the content twice, or give up the regex approach for a
            > > regular 'string index' parsing?
            > > Thanks in advance
            > >
            > >[/color]
            >
            >[/color]


            Comment

            • remy rakic

              #7
              Re: Regex question

              Aaah the non greedy option, now i know what it is used for. Thx ron, it
              works like a charm !

              "Ron Bullman" <ron.bulman@mai l.com> wrote in message
              news:O5wWmeqUDH A.2156@TK2MSFTN GP11.phx.gbl...[color=blue]
              > remy,
              >
              > How bout <a>(?<1>.+?)</a>
              >
              >
              > Ron
              > "remy rakic" <liquid@spamhol e.com> wrote in message
              > news:ea5aHMmUDH A.2272@TK2MSFTN GP11.phx.gbl...[color=green]
              > > Hi all, i was trying to parse some HTML and found myself in trouble with
              > > some regex processing (which i have never done before).
              > >
              > > What i am trying to do is to get content between two tags, including any
              > > html code. I can do stuff like this:
              > > "<a>([\w\s]*)</a>" on "<a>Not cool</a><a>Absolutel y not</a>" obviously[/color]
              > only[color=green]
              > > gets regular text content but no html tags, i wonder if someone could
              > > enlighten me on which regex to use in order to get results[/color][/color]
              "<really>Re ally[color=blue][color=green]
              > > not<cool/><at>all</at>" and "Absolutely not" on the string
              > > "<tag><tag2><a> <really>Reall y
              > > not<cool/><at>all</at></a></tag2>...<tag3>< a>Absolutely
              > > not</a></tag3></tag>" ? (Notice i can't use Xpath since i'm not sure[/color]
              > whether[color=green]
              > > the site is XHTML compliant or not (as the example is no xml))
              > >
              > > Should i process the content twice, or give up the regex approach for a
              > > regular 'string index' parsing?
              > > Thanks in advance
              > >
              > >[/color]
              >
              >[/color]


              Comment

              Working...