regex for stripping HTML

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Michael Vilain

    regex for stripping HTML

    Originally, I was using

    $value =~ s/<.*>//g;

    to strip HTML tags from a variable. It actually stripped everything
    from the first "<" to the last ">" after the ending tag. I found this
    regex in this group:

    $value =~ s/\<[^\<]+\>//g;

    and I'm trying to parse it out and figure out why it works. First off,
    some questions:

    - why escape the "<"? It's not one of the meta characters that has
    special meaning in a regex.

    - what's the difference between using ".*" to match any string and "+"
    to match a repeat of the character class "[^\<]".

    Just trying to deepen my understanding of regex. It's like whitewash --
    it gets more opaque with multiple coats.

    TIA,

    /MeV/

    --
    DeeDee, don't press that button! DeeDee! NO! Dee...



  • Koncept

    #2
    Re: regex for stripping HTML

    In article <vilain-8A69E5.10474828 102003@comcast. ash.giganews.co m>,
    Michael Vilain <vilain@spamcop .net> wrote:
    [color=blue]
    > Originally, I was using
    >
    > $value =~ s/<.*>//g;
    >
    > to strip HTML tags from a variable. It actually stripped everything
    > from the first "<" to the last ">" after the ending tag. I found this
    > regex in this group:
    >
    > $value =~ s/\<[^\<]+\>//g;
    >
    > and I'm trying to parse it out and figure out why it works. First off,
    > some questions:
    >
    > - why escape the "<"? It's not one of the meta characters that has
    > special meaning in a regex.
    >
    > - what's the difference between using ".*" to match any string and "+"
    > to match a repeat of the character class "[^\<]".
    >
    > Just trying to deepen my understanding of regex. It's like whitewash --
    > it gets more opaque with multiple coats.
    >
    > TIA,
    >
    > /MeV/[/color]

    Hello. This is from the Terminal Query:

    $ perldoc -q html

    Here's one "simple-minded" approach, that works for most files:

    #!/usr/bin/perl -p0777
    s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

    If you want a more complete solution, see the 3-stage
    striphtml
    program in http://www.cpan.org/authors/Tom_Chris-
    tiansen/scripts/striphtml.gz .
    --
    Koncept <<
    "Contrary to popular belief, the most dangerous animal is not the lion or
    tiger or even the elephant. The most dangerous animal is a shark riding
    on an elephant, just trampling and eating everything they see." - Jack Handey

    Comment

    • Eric J. Roode

      #3
      Re: regex for stripping HTML

      -----BEGIN PGP SIGNED MESSAGE-----
      Hash: SHA1

      "Michael Vilain <vilain@spamcop .net>" wrote in news:vilain-
      8A69E5.10474828 102003@comcast. ash.giganews.co m:
      [color=blue]
      > Originally, I was using
      >
      > $value =~ s/<.*>//g;
      >
      > to strip HTML tags from a variable. It actually stripped everything
      > from the first "<" to the last ">" after the ending tag. I found this
      > regex in this group:
      >
      > $value =~ s/\<[^\<]+\>//g;
      >
      > and I'm trying to parse it out and figure out why it works. First off,
      > some questions:
      >
      > - why escape the "<"? It's not one of the meta characters that has
      > special meaning in a regex.
      >
      > - what's the difference between using ".*" to match any string and "+"
      > to match a repeat of the character class "[^\<]".
      >
      > Just trying to deepen my understanding of regex. It's like whitewash[/color]
      --[color=blue]
      > it gets more opaque with multiple coats.[/color]

      Nah, it's not that hard. There's a learning curve, sure, but you'll get
      to the top of it in time.

      First, you are correct about the "<" -- no need to escape it; whoever did
      it wasn't thinking.

      Second, it helps to translate the regex sub-expressions into English
      (assuming English is your native tongue):

      <.*> means: Match a less-than character, followed by as many
      characters as possible, followed by a greather-than character.

      <[^>]+> means: Match a less-than character, followed by as many non-
      greater-than characters as possible, followed by a greater-than
      character.

      See the difference? . matches ANY character; [^>] matches only non-">"
      characters.


      Note that it is not possible in general to process HTML via regular
      expressions (at least, not simple regexes). Consider the following
      snippet of valid HTML:

      <img src="foo.jpg" alt='<<<"cool!" >>>' />

      - --
      Eric
      $_ = reverse sort $ /. r , qw p ekca lre uJ reh
      ts p , map $ _. $ " , qw e p h tona e and print

      -----BEGIN PGP SIGNATURE-----
      Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

      iQA/AwUBP59EVWPeouI eTNHoEQJRGQCguz B4DdBzsa/9dmTMRm4ExzMmxB UAoIIq
      bHd4Hbx8MdXgkJm 3sWoUu0K1
      =ADWR
      -----END PGP SIGNATURE-----

      Comment

      • DOV LEVENGLICK

        #4
        Re: regex for stripping HTML

        you have to escape < because it can be used as a search delimiter

        "Michael Vilain " wrote:
        [color=blue]
        >Originally, I was using
        >
        > $value =~ s/<.*>//g;
        >
        >to strip HTML tags from a variable. It actually stripped everything
        >from the first "<" to the last ">" after the ending tag. I found this
        >regex in this group:
        >
        > $value =~ s/\<[^\<]+\>//g;
        >
        >and I'm trying to parse it out and figure out why it works. First off,
        >some questions:
        >
        >- why escape the "<"? It's not one of the meta characters that has
        >special meaning in a regex.
        >
        >- what's the difference between using ".*" to match any string and "+"
        >to match a repeat of the character class "[^\<]".
        >
        >Just trying to deepen my understanding of regex. It's like whitewash --
        >it gets more opaque with multiple coats.
        >
        >TIA,
        >
        >/MeV/
        >
        >
        >[/color]

        --
        Regards,
        Dov Levenglick



        Comment

        • Anno Siegel

          #5
          Re: regex for stripping HTML

          DOV LEVENGLICK <Dov.Levenglick @motorola.com> wrote in comp.lang.perl. misc:[color=blue]
          > "Michael Vilain " wrote:[/color]

          [DOV's top-posting re-arranged]
          [color=blue][color=green]
          > > $value =~ s/\<[^\<]+\>//g;
          > >
          > >and I'm trying to parse it out and figure out why it works. First off,
          > >some questions:
          > >
          > >- why escape the "<"? It's not one of the meta characters that has
          > >special meaning in a regex.[/color]
          >
          > you have to escape < because it can be used as a search delimiter[/color]

          This is nonsense. What are you talking about? And don't top-post.

          Anno

          Comment

          Working...