regex for stripping HTML

**Koncept** · Jul 19 '05, 04:47 AM

Re: regex for stripping HTML

In article <vilain-8A69E5.10474828 102003@comcast. ash.giganews.co m>,
Michael Vilain <vilain@spamcop .net> wrote:
[color=blue]
> Originally, I was using
>
> $value =~ s/<.*>//g;
>
> to strip HTML tags from a variable. It actually stripped everything
> from the first "<" to the last ">" after the ending tag. I found this
> regex in this group:
>
> $value =~ s/\<[^\<]+\>//g;
>
> and I'm trying to parse it out and figure out why it works. First off,
> some questions:
>
> - why escape the "<"? It's not one of the meta characters that has
> special meaning in a regex.
>
> - what's the difference between using ".*" to match any string and "+"
> to match a repeat of the character class "[^\<]".
>
> Just trying to deepen my understanding of regex. It's like whitewash --
> it gets more opaque with multiple coats.
>
> TIA,
>
> /MeV/[/color]

Hello. This is from the Terminal Query:

$ perldoc -q html

Here's one "simple-minded" approach, that works for most files:

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage
striphtml
program in http://www.cpan.org/authors/Tom_Chris-
tiansen/scripts/striphtml.gz .

--
Koncept <<
"Contrary to popular belief, the most dangerous animal is not the lion or
tiger or even the elephant. The most dangerous animal is a shark riding
on an elephant, just trampling and eating everything they see." - Jack Handey

**Eric J. Roode** · Jul 19 '05, 04:47 AM

Re: regex for stripping HTML

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

"Michael Vilain <vilain@spamcop .net>" wrote in news:vilain-
8A69E5.10474828 102003@comcast. ash.giganews.co m:
[color=blue]
> Originally, I was using
>
> $value =~ s/<.*>//g;
>
> to strip HTML tags from a variable. It actually stripped everything
> from the first "<" to the last ">" after the ending tag. I found this
> regex in this group:
>
> $value =~ s/\<[^\<]+\>//g;
>
> and I'm trying to parse it out and figure out why it works. First off,
> some questions:
>
> - why escape the "<"? It's not one of the meta characters that has
> special meaning in a regex.
>
> - what's the difference between using ".*" to match any string and "+"
> to match a repeat of the character class "[^\<]".
>
> Just trying to deepen my understanding of regex. It's like whitewash[/color]
--[color=blue]
> it gets more opaque with multiple coats.[/color]

Nah, it's not that hard. There's a learning curve, sure, but you'll get
to the top of it in time.

First, you are correct about the "<" -- no need to escape it; whoever did
it wasn't thinking.

Second, it helps to translate the regex sub-expressions into English
(assuming English is your native tongue):

<.*> means: Match a less-than character, followed by as many
characters as possible, followed by a greather-than character.

<[^>]+> means: Match a less-than character, followed by as many non-
greater-than characters as possible, followed by a greater-than
character.

See the difference? . matches ANY character; [^>] matches only non-">"
characters.

Note that it is not possible in general to process HTML via regular
expressions (at least, not simple regexes). Consider the following
snippet of valid HTML:

<img src="foo.jpg" alt='<<<"cool!" >>>' />

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP59EVWPeouI eTNHoEQJRGQCguz B4DdBzsa/9dmTMRm4ExzMmxB UAoIIq
bHd4Hbx8MdXgkJm 3sWoUu0K1
=ADWR
-----END PGP SIGNATURE-----

**DOV LEVENGLICK** · Jul 19 '05, 04:47 AM

Re: regex for stripping HTML

you have to escape < because it can be used as a search delimiter

"Michael Vilain " wrote:
[color=blue]
>Originally, I was using
>
> $value =~ s/<.*>//g;
>
>to strip HTML tags from a variable. It actually stripped everything
>from the first "<" to the last ">" after the ending tag. I found this
>regex in this group:
>
> $value =~ s/\<[^\<]+\>//g;
>
>and I'm trying to parse it out and figure out why it works. First off,
>some questions:
>
>- why escape the "<"? It's not one of the meta characters that has
>special meaning in a regex.
>
>- what's the difference between using ".*" to match any string and "+"
>to match a repeat of the character class "[^\<]".
>
>Just trying to deepen my understanding of regex. It's like whitewash --
>it gets more opaque with multiple coats.
>
>TIA,
>
>/MeV/
>
>
>[/color]

--
Regards,
Dov Levenglick

**Anno Siegel** · Jul 19 '05, 04:47 AM

Re: regex for stripping HTML

DOV LEVENGLICK <Dov.Levenglick @motorola.com> wrote in comp.lang.perl. misc:[color=blue]
> "Michael Vilain " wrote:[/color]

[DOV's top-posting re-arranged]
[color=blue][color=green]
> > $value =~ s/\<[^\<]+\>//g;
> >
> >and I'm trying to parse it out and figure out why it works. First off,
> >some questions:
> >
> >- why escape the "<"? It's not one of the meta characters that has
> >special meaning in a regex.[/color]
>
> you have to escape < because it can be used as a search delimiter[/color]

This is nonsense. What are you talking about? And don't top-post.

Anno

regex for stripping HTML

regex for stripping HTML

Comment

Comment

Comment

Comment