Regular Expression HELP!

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • jn

    Regular Expression HELP!

    I'm stripping out the attributes in <TD> tags...but I want to strip out
    everything BUT the COLSPAN attribute.

    The following strips out all attributes. What do I do if I want to keep a
    certain one?

    eregi_replace(" <TD[^>]*>","<TD>", $string);

    I suck at regular expressions. I need some help.

    Thanks


  • Justin Koivisto

    #2
    Re: Regular Expression HELP!

    jn wrote:
    [color=blue]
    > I'm stripping out the attributes in <TD> tags...but I want to strip out
    > everything BUT the COLSPAN attribute.
    >
    > The following strips out all attributes. What do I do if I want to keep a
    > certain one?
    >
    > eregi_replace(" <TD[^>]*>","<TD>", $string);[/color]

    This is what I had used one time long ago:

    preg_match_all( '/<td(\s([^>]+)*)>/i',$subject,$at tributes);
    preg_match_all( '/[a-z]+\s*=\s*(\'|")? ([^\'"]*)\\1/i',$attributes[2][0],$attributes);
    $attributes=$at tributes[0];

    I'm sure someone will have something better though...

    --
    Justin Koivisto - spam@koivi.com
    PHP POSTERS: Please use comp.lang.php for PHP related questions,
    alt.php* groups are not recommended.

    Comment

    • Pedro

      #3
      Re: Regular Expression HELP!

      jn wrote:[color=blue]
      > I'm stripping out the attributes in <TD> tags...but I want to strip out
      > everything BUT the COLSPAN attribute.
      >
      > The following strips out all attributes. What do I do if I want to keep a
      > certain one?
      >
      > eregi_replace(" <TD[^>]*>","<TD>", $string);[/color]

      preg_replace is faster and more powerful

      I tried this:
      <?php

      $data = '===<td a="b" colspan="3" x="y">===';

      $regex = '<td([^>]*)( colspan=\S+)([^>]*)>';

      $newdata = preg_replace("/$regex/i", '<td$2>', $data);

      echo $newdata, "\n";

      ?>


      The output was:
      ===<td colspan="3">===


      HTH
      --
      I have a spam filter working.
      To mail me include "urkxvq" (with or without the quotes)
      in the subject line, or your mail will be ruthlessly discarded.

      Comment

      • jn

        #4
        Re: Regular Expression HELP!

        "Pedro" <hexkid@hotpop. com> wrote in message
        news:bo9cfa$1c2 09b$1@ID-203069.news.uni-berlin.de...[color=blue]
        > jn wrote:[color=green]
        > > I'm stripping out the attributes in <TD> tags...but I want to strip out
        > > everything BUT the COLSPAN attribute.
        > >
        > > The following strips out all attributes. What do I do if I want to keep[/color][/color]
        a[color=blue][color=green]
        > > certain one?
        > >
        > > eregi_replace(" <TD[^>]*>","<TD>", $string);[/color]
        >
        > preg_replace is faster and more powerful
        >
        > I tried this:
        > <?php
        >
        > $data = '===<td a="b" colspan="3" x="y">===';
        >
        > $regex = '<td([^>]*)( colspan=\S+)([^>]*)>';
        >
        > $newdata = preg_replace("/$regex/i", '<td$2>', $data);
        >
        > echo $newdata, "\n";
        >
        > ?>
        >
        >
        > The output was:
        > ===<td colspan="3">===
        >
        >
        > HTH
        > --
        > I have a spam filter working.
        > To mail me include "urkxvq" (with or without the quotes)
        > in the subject line, or your mail will be ruthlessly discarded.
        >[/color]

        That does indeed strip out everything but the colspan! But how do I strip
        out everything in TD tags that don't have the colspan at the same time?
        Maybe a pattern that matches TD tags if they don't contain colspan?

        I wish I knew this stuff...it's very useful.


        Comment

        • jn

          #5
          Re: Regular Expression HELP!


          "Justin Koivisto" <spam@koivi.com > wrote in message
          news:QgWpb.584$ Uz.15999@news7. onvoy.net...[color=blue]
          > jn wrote:
          >[color=green]
          > > I'm stripping out the attributes in <TD> tags...but I want to strip out
          > > everything BUT the COLSPAN attribute.
          > >
          > > The following strips out all attributes. What do I do if I want to keep[/color][/color]
          a[color=blue][color=green]
          > > certain one?
          > >
          > > eregi_replace(" <TD[^>]*>","<TD>", $string);[/color]
          >
          > This is what I had used one time long ago:
          >
          > preg_match_all( '/<td(\s([^>]+)*)>/i',$subject,$at tributes);
          >[/color]
          preg_match_all( '/[a-z]+\s*=\s*(\'|")? ([^\'"]*)\\1/i',$attributes[2][0],$attr
          ibutes);[color=blue]
          > $attributes=$at tributes[0];
          >
          > I'm sure someone will have something better though...
          >
          > --
          > Justin Koivisto - spam@koivi.com
          > PHP POSTERS: Please use comp.lang.php for PHP related questions,
          > alt.php* groups are not recommended.
          >
          >[/color]

          Thanks for the reply. That's pretty scary looking :)


          Comment

          • Eric Ellsworth

            #6
            Re: Regular Expression HELP!

            > > $regex = '<td([^>]*)( colspan=\S+)([^>]*)>';[color=blue][color=green]
            > >
            > > $newdata = preg_replace("/$regex/i", '<td$2>', $data);
            > >[/color][/color]
            [color=blue]
            > That does indeed strip out everything but the colspan! But how do I strip
            > out everything in TD tags that don't have the colspan at the same time?
            > Maybe a pattern that matches TD tags if they don't contain colspan?[/color]

            The above regex is very elegant. If you add a ? after the second regex it
            will make matching the colspan optional. This can be problematic in terms
            of what gets assigned to $1 and $2, so you can add ?: to those previous
            patterns to suppress matching, and then use $1, which should be either the
            colspan statement of null (but I haven't tested it, so I don't guarantee
            it).
            So the new regex would be:
            $regex = '<td(?:[^>]*)( colspan=\S+)?(? :[^>]*)>';


            $newdata = preg_replace("/$regex/i", '<td$1>', $data);

            Another approach is to use preg_replace_ca llback:

            [color=blue]
            > I wish I knew this stuff...it's very useful.[/color]
            I highly recommend the book Mastering Regular Expressions, by Jeffrey
            Friedl. It's very easy to ready and really gets you understand regexes.

            Cheers,

            Eric
            "jn" <jsumner1@cfl.r r.com> wrote in message
            news:hNWpb.1595 68$ox6.2203215@ twister.tampaba y.rr.com...[color=blue]
            > "Pedro" <hexkid@hotpop. com> wrote in message
            > news:bo9cfa$1c2 09b$1@ID-203069.news.uni-berlin.de...[color=green]
            > > jn wrote:[color=darkred]
            > > > I'm stripping out the attributes in <TD> tags...but I want to strip[/color][/color][/color]
            out[color=blue][color=green][color=darkred]
            > > > everything BUT the COLSPAN attribute.
            > > >
            > > > The following strips out all attributes. What do I do if I want to[/color][/color][/color]
            keep[color=blue]
            > a[color=green][color=darkred]
            > > > certain one?
            > > >
            > > > eregi_replace(" <TD[^>]*>","<TD>", $string);[/color]
            > >
            > > preg_replace is faster and more powerful
            > >
            > > I tried this:
            > > <?php
            > >
            > > $data = '===<td a="b" colspan="3" x="y">===';
            > >
            > > $regex = '<td([^>]*)( colspan=\S+)([^>]*)>';
            > >
            > > $newdata = preg_replace("/$regex/i", '<td$2>', $data);
            > >
            > > echo $newdata, "\n";
            > >
            > > ?>
            > >
            > >
            > > The output was:
            > > ===<td colspan="3">===
            > >
            > >
            > > HTH
            > > --
            > > I have a spam filter working.
            > > To mail me include "urkxvq" (with or without the quotes)
            > > in the subject line, or your mail will be ruthlessly discarded.
            > >[/color]
            >
            > That does indeed strip out everything but the colspan! But how do I strip
            > out everything in TD tags that don't have the colspan at the same time?
            > Maybe a pattern that matches TD tags if they don't contain colspan?
            >
            > I wish I knew this stuff...it's very useful.
            >
            >[/color]


            Comment

            • jn

              #7
              Re: Regular Expression HELP!


              "Eric Ellsworth" <s@n> wrote in message
              news:U4OdnTDEyu EAozWiRTvUrg@sp eakeasy.net...[color=blue][color=green][color=darkred]
              > > > $regex = '<td([^>]*)( colspan=\S+)([^>]*)>';
              > > >
              > > > $newdata = preg_replace("/$regex/i", '<td$2>', $data);
              > > >[/color][/color]
              >[color=green]
              > > That does indeed strip out everything but the colspan! But how do I[/color][/color]
              strip[color=blue][color=green]
              > > out everything in TD tags that don't have the colspan at the same time?
              > > Maybe a pattern that matches TD tags if they don't contain colspan?[/color]
              >
              > The above regex is very elegant. If you add a ? after the second regex it
              > will make matching the colspan optional. This can be problematic in terms
              > of what gets assigned to $1 and $2, so you can add ?: to those previous
              > patterns to suppress matching, and then use $1, which should be either the
              > colspan statement of null (but I haven't tested it, so I don't guarantee
              > it).
              > So the new regex would be:
              > $regex = '<td(?:[^>]*)( colspan=\S+)?(? :[^>]*)>';
              >
              >
              > $newdata = preg_replace("/$regex/i", '<td$1>', $data);
              >
              > Another approach is to use preg_replace_ca llback:
              > http://us4.php.net/manual/en/functio...e-callback.php
              >[color=green]
              > > I wish I knew this stuff...it's very useful.[/color]
              > I highly recommend the book Mastering Regular Expressions, by Jeffrey
              > Friedl. It's very easy to ready and really gets you understand regexes.
              >
              > Cheers,
              >
              > Eric
              > "jn" <jsumner1@cfl.r r.com> wrote in message
              > news:hNWpb.1595 68$ox6.2203215@ twister.tampaba y.rr.com...[color=green]
              > > "Pedro" <hexkid@hotpop. com> wrote in message
              > > news:bo9cfa$1c2 09b$1@ID-203069.news.uni-berlin.de...[color=darkred]
              > > > jn wrote:
              > > > > I'm stripping out the attributes in <TD> tags...but I want to strip[/color][/color]
              > out[color=green][color=darkred]
              > > > > everything BUT the COLSPAN attribute.
              > > > >
              > > > > The following strips out all attributes. What do I do if I want to[/color][/color]
              > keep[color=green]
              > > a[color=darkred]
              > > > > certain one?
              > > > >
              > > > > eregi_replace(" <TD[^>]*>","<TD>", $string);
              > > >
              > > > preg_replace is faster and more powerful
              > > >
              > > > I tried this:
              > > > <?php
              > > >
              > > > $data = '===<td a="b" colspan="3" x="y">===';
              > > >
              > > > $regex = '<td([^>]*)( colspan=\S+)([^>]*)>';
              > > >
              > > > $newdata = preg_replace("/$regex/i", '<td$2>', $data);
              > > >
              > > > echo $newdata, "\n";
              > > >
              > > > ?>
              > > >
              > > >
              > > > The output was:
              > > > ===<td colspan="3">===
              > > >
              > > >
              > > > HTH
              > > > --
              > > > I have a spam filter working.
              > > > To mail me include "urkxvq" (with or without the quotes)
              > > > in the subject line, or your mail will be ruthlessly discarded.
              > > >[/color]
              > >
              > > That does indeed strip out everything but the colspan! But how do I[/color][/color]
              strip[color=blue][color=green]
              > > out everything in TD tags that don't have the colspan at the same time?
              > > Maybe a pattern that matches TD tags if they don't contain colspan?
              > >
              > > I wish I knew this stuff...it's very useful.
              > >
              > >[/color]
              >
              >
              >[/color]

              Thanks, but it stripped out everything, including the colspan. I'll try to
              tinker with it and see if I can get it to work though.



              Comment

              • Pedro

                #8
                Re: Regular Expression HELP!

                Eric Ellsworth wrote:[color=blue]
                > So the new regex would be:
                > $regex = '<td(?:[^>]*)( colspan=\S+)?(? :[^>]*)>';[/color]

                Maybe regex's aren't the best way to do this ... however I *had* to
                manage it. Here it is for your enjoyment:

                <?php
                $s = ''; ### test data
                $s.= 'CS ===<td a="b" color="blue" colspan="3" x="y">===' . "\n";
                $s.= ' ===<td a="b" color="blue" rowspan="3" x="y">===' . "\n";
                $s.= 'CS ===<td a="b" colspan="3" x="y">===' . "\n";
                $s.= ' ===<td a="b" rowspan="3" x="y">===' . "\n";
                $s.= 'CS ===<td colspan="3" x="y">===' . "\n";
                $s.= ' ===<td rowspan="3" x="y">===' . "\n";
                $s.= 'CS ===<td a="b" colspan="3">=== ' . "\n";
                $s.= ' ===<td a="b" rowspan="3">=== ' . "\n";
                $s.= 'CS ===<td colspan="3">=== ' . "\n";
                $s.= ' ===<td rowspan="3">=== ' . "\n";
                $s.= ' ===<td>===' . "\n";
                $s.= ' ====== :)' . "\n";

                $cs = '( colspan=[0-9\'"]+)?'; # optional " colspan=" followed by one or more digits or quotes
                $ns = '(?:(?! colspan=[0-9\'"]+) \S+)*'; # zero or more, not grabbed *NOT* colspan
                # ^^^------------------^ negative lookahead assertion

                $regex = "<td$cs$ns$cs$n s$cs>"; # colspan can be immediately after td,
                # or in the middle of the
                # parameters or at the last position


                $newx = preg_replace("/$regex/i", '<td$1$2$3>', $s);

                echo "original:\ n", $s, "\n\nchanged:\n ", $newx, "\n";
                ?>
                [color=blue]
                > Another approach is to use preg_replace_ca llback:
                > http://us4.php.net/manual/en/functio...e-callback.php[/color]

                And not learn the "negative lookahead assertion"? :-))
                This was a very challenging challenge!

                --
                I have a spam filter working.
                To mail me include "urkxvq" (with or without the quotes)
                in the subject line, or your mail will be ruthlessly discarded.

                Comment

                • jn

                  #9
                  Re: Regular Expression HELP!


                  "Pedro" <hexkid@hotpop. com> wrote in message
                  news:bo9lgp$1as k8u$1@ID-203069.news.uni-berlin.de...[color=blue]
                  > Eric Ellsworth wrote:[color=green]
                  > > So the new regex would be:
                  > > $regex = '<td(?:[^>]*)( colspan=\S+)?(? :[^>]*)>';[/color]
                  >
                  > Maybe regex's aren't the best way to do this ... however I *had* to
                  > manage it. Here it is for your enjoyment:
                  >
                  > <?php
                  > $s = ''; ### test data
                  > $s.= 'CS ===<td a="b" color="blue" colspan="3" x="y">===' . "\n";
                  > $s.= ' ===<td a="b" color="blue" rowspan="3" x="y">===' . "\n";
                  > $s.= 'CS ===<td a="b" colspan="3" x="y">===' . "\n";
                  > $s.= ' ===<td a="b" rowspan="3" x="y">===' . "\n";
                  > $s.= 'CS ===<td colspan="3" x="y">===' . "\n";
                  > $s.= ' ===<td rowspan="3" x="y">===' . "\n";
                  > $s.= 'CS ===<td a="b" colspan="3">=== ' . "\n";
                  > $s.= ' ===<td a="b" rowspan="3">=== ' . "\n";
                  > $s.= 'CS ===<td colspan="3">=== ' . "\n";
                  > $s.= ' ===<td rowspan="3">=== ' . "\n";
                  > $s.= ' ===<td>===' . "\n";
                  > $s.= ' ====== :)' . "\n";
                  >
                  > $cs = '( colspan=[0-9\'"]+)?'; # optional " colspan=" followed by one or[/color]
                  more digits or quotes[color=blue]
                  > $ns = '(?:(?! colspan=[0-9\'"]+) \S+)*'; # zero or more, not grabbed *NOT*[/color]
                  colspan[color=blue]
                  > # ^^^------------------^ negative lookahead assertion
                  >
                  > $regex = "<td$cs$ns$cs$n s$cs>"; # colspan can be immediately after td,
                  > # or in the middle of the
                  > # parameters or at the last position
                  >
                  >
                  > $newx = preg_replace("/$regex/i", '<td$1$2$3>', $s);
                  >
                  > echo "original:\ n", $s, "\n\nchanged:\n ", $newx, "\n";
                  > ?>
                  >[color=green]
                  > > Another approach is to use preg_replace_ca llback:
                  > > http://us4.php.net/manual/en/functio...e-callback.php[/color]
                  >
                  > And not learn the "negative lookahead assertion"? :-))
                  > This was a very challenging challenge!
                  >
                  > --
                  > I have a spam filter working.
                  > To mail me include "urkxvq" (with or without the quotes)
                  > in the subject line, or your mail will be ruthlessly discarded.
                  >[/color]

                  That was interesting :)

                  What I'm really doing is pasting from Excel into an "HTML Area" (
                  www.interactivetools.com). It's like a text area, but it's a little WYSIWYG
                  editor for content management systems. I'm stripping out all of the style
                  garbage Excel puts in its code, and replacing it with cleaned code. It works
                  great now, but I can't get it to preserve colspans because those get
                  stripped too.

                  I'll try some more things. Maybe I'll get it to work :)

                  Thanks guys


                  Comment

                  • R. Rajesh Jeba Anbiah

                    #10
                    Re: Regular Expression HELP!

                    "jn" <jsumner1@cfl.r r.com> wrote in message news:<TU7qb.164 506$ox6.2297683 @twister.tampab ay.rr.com>...[color=blue]
                    > "Pedro" <hexkid@hotpop. com> wrote in message
                    > news:bo9lgp$1as k8u$1@ID-203069.news.uni-berlin.de...[color=green]
                    > > Eric Ellsworth wrote:[color=darkred]
                    > > > So the new regex would be:
                    > > > $regex = '<td(?:[^>]*)( colspan=\S+)?(? :[^>]*)>';[/color][/color][/color]

                    [color=blue]
                    > What I'm really doing is pasting from Excel into an "HTML Area" (
                    > www.interactivetools.com). It's like a text area, but it's a little WYSIWYG
                    > editor for content management systems. I'm stripping out all of the style
                    > garbage Excel puts in its code, and replacing it with cleaned code. It works
                    > great now, but I can't get it to preserve colspans because those get
                    > stripped too.
                    >
                    > I'll try some more things. Maybe I'll get it to work :)[/color]

                    Try http://weitz.de/regex-coach

                    ---
                    "Learn from yesterday, live for today, hope for tomorrow. The
                    important thing is to not stop questioning."---Albert Einstein
                    Email: rrjanbiah-at-Y!com

                    Comment

                    Working...