Using regex in html code

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Nightcrawler

    Using regex in html code

    Hi all.

    I have a html table with multiple rows (one row example below). I
    would like to extract everything within the <tdtags into groups on a
    row by row basis. The process would be: find the first row, then
    extract the column data, store data in a textfile, find the next row,
    extract the column data, store data in a textfile.... and so on till
    we go through all the rows in the document.

    Please help.

    Thanks in advance.

    <tr>
    <td>1</td>
    <td>GET UP </td>
    <td>CIARA FT CHAMILLIONAIRE</td>
    <td>04:25</td>
    <td>128.66</td>
    <td></td>
    <td>Step Up [Soundtrack]</td>
    <td></td>
    <td>R&B/Rap</td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td>D:\Ciara feat. Chamillionare - Get Up.mp3</td>
    <td>Stripe, (-1.6 dB, -0.7 dB)</td>
    <td></td>
    <td></td>
    <td>2006/01/01</td>
    <td>256000</td>
    <td></td>
    <td>2</td>
    <td>2007/03/28</td>
    <td>2006/12/04</td>
    <td>2007/3/28 20:50:16</td>
    <td>00:07</td>
    <td>B</td>
    </tr>

  • Jesse Houwing

    #2
    Re: Using regex in html code

    * Nightcrawler wrote, On 23-5-2007 6:59:
    Hi all.
    >
    I have a html table with multiple rows (one row example below). I
    would like to extract everything within the <tdtags into groups on a
    row by row basis. The process would be: find the first row, then
    extract the column data, store data in a textfile, find the next row,
    extract the column data, store data in a textfile.... and so on till
    we go through all the rows in the document.
    You're better off using the HTML Agility Pack.

    But it can be done using regex:

    <tr((?!<td).)*( ?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
    ExplicitCapure ON
    SingleLine ON
    SaseInsensitive ON

    This will give you one group which will hold all the TD's found. I've
    written it quite robust, but this isn't the best available
    implementation. If the HTML tables are of a well known format, this
    would be no problem. If they come from an external source, you might wat
    to test more rigorously.

    I'll try to explain:
    <tr((?!<td).) *
    Find every a TR starting tag and capture anything after that till you
    find a <td

    (?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*
    snip off the TD tag and capture it's content till you're at a </td. Then
    caputure the </tdand any whitespace or newline that might follow.
    Repeat till all TD's have been tagged for this row.

    ((?!</tr).)*</tr[^>"*]*>
    Capture everything that follows the last <td>...</tdcombination

    Executing Regex.Matches will give you a MatchCollection . Each item in
    the matchcollection will have 1 Group named "TD". This group has a list
    of Captures which will contain all the values captured in this Group name.

    Kind Regards,

    Jesse Houwing
    >
    Please help.
    >
    Thanks in advance.
    >
    <tr>
    <td>1</td>
    <td>GET UP </td>
    <td>CIARA FT CHAMILLIONAIRE</td>
    <td>04:25</td>
    <td>128.66</td>
    <td></td>
    <td>Step Up [Soundtrack]</td>
    <td></td>
    <td>R&B/Rap</td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td>D:\Ciara feat. Chamillionare - Get Up.mp3</td>
    <td>Stripe, (-1.6 dB, -0.7 dB)</td>
    <td></td>
    <td></td>
    <td>2006/01/01</td>
    <td>256000</td>
    <td></td>
    <td>2</td>
    <td>2007/03/28</td>
    <td>2006/12/04</td>
    <td>2007/3/28 20:50:16</td>
    <td>00:07</td>
    <td>B</td>
    </tr>
    >

    Comment

    • Kevin Spencer

      #3
      Re: Using regex in html code

      You will need to split the string in order to do this. It can be done by
      using 2 regular expressions, very similar:

      (?s)<tr[^>]*>(?<content>.* ?)</tr>

      Splits the table into a match for each row.

      Once you have the array of row strings, you can use:

      (?s)<td[^>]*>(?<content>.* ?)</td>

      Splits the row into a match for each column.

      The reason it can't be done in one pass is that you need to create a match
      for each row, and the match cannot contain "sub-matches," only groups, and
      unless you know how many columns there are, you can't create a group for
      each column. If you DO know how many columns there are, you can, as in:

      (?s)<tr[^>]*>.*?(?<row1><t d[^>]*>(?<row1conten t>.*?)</td>).*?(?<row2> <td[^>]*>(?<row2conten t>.*?)</td>).*?</tr>

      --
      HTH,

      Kevin Spencer
      Microsoft MVP

      Printing Components, Email Components,
      FTP Client Classes, Enhanced Data Controls, much more.
      DSI PrintManager, Miradyne Component Libraries:


      "Nightcrawl er" <thomas.zaleski @gmail.comwrote in message
      news:1179896379 .810049.75940@k 79g2000hse.goog legroups.com...
      Hi all.
      >
      I have a html table with multiple rows (one row example below). I
      would like to extract everything within the <tdtags into groups on a
      row by row basis. The process would be: find the first row, then
      extract the column data, store data in a textfile, find the next row,
      extract the column data, store data in a textfile.... and so on till
      we go through all the rows in the document.
      >
      Please help.
      >
      Thanks in advance.
      >
      <tr>
      <td>1</td>
      <td>GET UP </td>
      <td>CIARA FT CHAMILLIONAIRE</td>
      <td>04:25</td>
      <td>128.66</td>
      <td></td>
      <td>Step Up [Soundtrack]</td>
      <td></td>
      <td>R&B/Rap</td>
      <td></td>
      <td></td>
      <td></td>
      <td></td>
      <td>D:\Ciara feat. Chamillionare - Get Up.mp3</td>
      <td>Stripe, (-1.6 dB, -0.7 dB)</td>
      <td></td>
      <td></td>
      <td>2006/01/01</td>
      <td>256000</td>
      <td></td>
      <td>2</td>
      <td>2007/03/28</td>
      <td>2006/12/04</td>
      <td>2007/3/28 20:50:16</td>
      <td>00:07</td>
      <td>B</td>
      </tr>
      >

      Comment

      • Jesse Houwing

        #4
        Re: Using regex in html code

        <SNIP>
        The reason it can't be done in one pass is that you need to create a match
        for each row, and the match cannot contain "sub-matches," only groups, and
        unless you know how many columns there are, you can't create a group for
        each column. If you DO know how many columns there are, you can, as in:
        >
        Kevin,

        You actually can get multiple results for the same named group. the
        structure is as follows:

        MatchCollection 1 ----* Groups 1 ----* Captures

        Which - sort of - translates to:

        Rows ----* Cells ----* Cell Values

        The expression which will capture this info correctly would then be
        something like this:

        <tr((?!<td).)*( ?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
        ExplicitCapure ON
        SingleLine ON
        SaseInsensitive ON

        I tested it and it works like a charm.

        Kind regards,

        Jesse Houwing

        Comment

        • Jesse Houwing

          #5
          Re: Using regex in html code

          <SNIP>
          The reason it can't be done in one pass is that you need to create a
          match for each row, and the match cannot contain "sub-matches," only
          groups, and unless you know how many columns there are, you can't create
          a group for each column. If you DO know how many columns there are, you
          can, as in:
          >
          Kevin,

          You actually can get multiple results for the same named group. the
          structure is as follows:

          MatchCollection 1 ----* Groups 1 ----* Captures

          Which - sort of - translates to:

          Rows ----* Cells ----* Cell Values

          The expression which will capture this info correctly would then be
          something like this:

          <tr((?!<td).)*( ?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
          ExplicitCapure ON
          SingleLine ON
          SaseInsensitive ON

          I tested it and it works like a charm.

          Kind regards,

          Jesse Houwing

          Comment

          • Kevin Spencer

            #6
            Re: Using regex in html code

            I've got to hand it to you, Jesse.That is possibly the most creative use
            I've ever seen of regular expressions and the System.Text.Reg ularExpressions
            NameSpace and classes. I tested it too, and while it took me a good while to
            get my head around what it was doing, and I will have to mull it over some
            more before I fully understand it, it does work beautifully. I'd love to see
            some more of your regex work some time.

            --
            HTH,

            Kevin Spencer
            Microsoft MVP

            Printing Components, Email Components,
            FTP Client Classes, Enhanced Data Controls, much more.
            DSI PrintManager, Miradyne Component Libraries:


            "Jesse Houwing" <jesse.houwing@ nospam-sogeti.nlwrote in message
            news:46543EDD.1 010400@nospam-sogeti.nl...
            <SNIP>
            >
            >The reason it can't be done in one pass is that you need to create a
            >match for each row, and the match cannot contain "sub-matches," only
            >groups, and unless you know how many columns there are, you can't create
            >a group for each column. If you DO know how many columns there are, you
            >can, as in:
            >>
            >
            Kevin,
            >
            You actually can get multiple results for the same named group. the
            structure is as follows:
            >
            MatchCollection 1 ----* Groups 1 ----* Captures
            >
            Which - sort of - translates to:
            >
            Rows ----* Cells ----* Cell Values
            >
            The expression which will capture this info correctly would then be
            something like this:
            >
            <tr((?!<td).)*( ?><td[^>]*>(?<td>((?!</td).)*)</td[^>]*>\s*)*((?!</tr).)*</tr[^>"*]*>
            ExplicitCapure ON
            SingleLine ON
            SaseInsensitive ON
            >
            I tested it and it works like a charm.
            >
            Kind regards,
            >
            Jesse Houwing

            Comment

            • Jesse Houwing

              #7
              Re: Using regex in html code

              * Kevin Spencer wrote, On 24-5-2007 13:48:
              I've got to hand it to you, Jesse.That is possibly the most creative use
              I've ever seen of regular expressions and the System.Text.Reg ularExpressions
              NameSpace and classes. I tested it too, and while it took me a good while to
              get my head around what it was doing, and I will have to mull it over some
              more before I fully understand it, it does work beautifully. I'd love to see
              some more of your regex work some time.
              >
              Kevin,

              Thank you :).

              Jesse

              Comment

              Working...