need help with re module

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • linuxprog

    need help with re module

    hello

    i have that string "<html>hell o</a>world<anytag> ok" and i want to
    extract all the text , without html tags , the result should be some
    thing like that : helloworldok

    i have tried that :

    from re import findall

    chaine = """<html>he llo</a>world<anytag> ok"""

    print findall('[a-zA-z][^(<.*>)].+?[a-zA-Z]',chaine)
    >>['html', 'hell', 'worl', 'anyt', 'ag>o']
    the result is not correct ! what would be the correct regex to use ?



  • Matimus

    #2
    Re: need help with re module

    On Jun 20, 9:58 am, linuxprog <linuxp...@gmai l.comwrote:
    hello
    >
    i have that string "<html>hell o</a>world<anytag> ok" and i want to
    extract all the text , without html tags , the result should be some
    thing like that : helloworldok
    >
    i have tried that :
    >
    from re import findall
    >
    chaine = """<html>he llo</a>world<anytag> ok"""
    >
    print findall('[a-zA-z][^(<.*>)].+?[a-zA-Z]',chaine)
    >
    >>['html', 'hell', 'worl', 'anyt', 'ag>o']
    >
    the result is not correct ! what would be the correct regex to use ?
    This: [^(<.*>)] is a set that contains everything but the characters
    "(","<",".","*" ,">" and ")". It most certainly doesn't do what you
    want it to. Is it absolutely necessary that you use a regular
    expression? There are a few HTML parsing libraries out there. The
    easiest approach using re might be to do a search and replace on all
    tags. Just replace the tags with nothing.

    Matt

    Comment

    • Matimus

      #3
      Re: need help with re module

      Here is an example:
      >>s = "<html>Hell o</a>world<anytag> ok"
      >>matchtags = re.compile(r"<[^>]+>")
      >>matchtags.fin dall(s)
      ['<html>', '</a>', '<anytag>']
      >>matchtags.sub ('',s)
      'Helloworldok'

      I probably shouldn't have shown you that. It may not work for all
      HTML, and you should probably be looking at something like
      BeautifulSoup.

      Matt

      Comment

      Working...