need help with re module

**Matimus** · Jun 20 '07, 06:05 PM

Re: need help with re module

On Jun 20, 9:58 am, linuxprog <linuxp...@gmai l.comwrote:

hello
>
i have that string "<html>hell o</a>world<anytag> ok" and i want to
extract all the text , without html tags , the result should be some
thing like that : helloworldok
>
i have tried that :
>
from re import findall
>
chaine = """<html>he llo</a>world<anytag> ok"""
>
print findall('[a-zA-z][^(<.*>)].+?[a-zA-Z]',chaine)
>

>>['html', 'hell', 'worl', 'anyt', 'ag>o']

>
the result is not correct ! what would be the correct regex to use ?

This: [^(<.*>)] is a set that contains everything but the characters
"(","<",".","*" ,">" and ")". It most certainly doesn't do what you
want it to. Is it absolutely necessary that you use a regular
expression? There are a few HTML parsing libraries out there. The
easiest approach using re might be to do a search and replace on all
tags. Just replace the tags with nothing.

Matt

**Matimus** · Jun 20 '07, 08:35 PM

Re: need help with re module

Here is an example:

>>s = "<html>Hell o</a>world<anytag> ok"
>>matchtags = re.compile(r"<[^>]+>")
>>matchtags.fin dall(s)

['<html>', '</a>', '<anytag>']

>>matchtags.sub ('',s)

'Helloworldok'

I probably shouldn't have shown you that. It may not work for all
HTML, and you should probably be looking at something like
BeautifulSoup.

Matt

need help with re module

need help with re module

Comment

Comment