Remove HTML tags (except anchor tag) from a string using regularexpressions

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Nico Grubert

    Remove HTML tags (except anchor tag) from a string using regularexpressions

    Hello,

    I want to remove all html tags from a string "content" except <a
    ....>xxx</a>.

    My script reads like this:

    ###
    import re
    content = re.sub('<([^!>]([^>]|\n)*)>', '', content)
    ###

    It works fine. It removes all html tags from "content".
    Unfortunately, this also removes <a ...>xxx</a> occurancies.
    Any idea, how to modify this to remove all html tags except <a ...>xxx</a>?

    Thanks in advance,
    Nico
  • Anand

    #2
    Re: Remove HTML tags (except anchor tag) from a string using regular expressions

    How about...

    import re
    content = re.sub('<([^!(a>)]([^(/a>)]|\n)*)>', '', content)
    Seems to work for me.

    HTH

    -Anand

    Comment

    • Anand

      #3
      Re: Remove HTML tags (except anchor tag) from a string using regular expressions

      I meant
      content = re.sub ('<[^!(a>)]([^>]|\n)*[^!(/a)]>', '', content)

      Sorry for the mistake.
      However this seems to also print tags like <b>, <p> etc
      also.

      -Anand

      Comment

      • Max M

        #4
        Re: Remove HTML tags (except anchor tag) from a string using regularexpressi ons

        Nico Grubert wrote:

        If it's not to learn, and you simply want it to work, try out this library:




        --

        hilsen/regards Max M, Denmark


        IT's Mad Science

        Comment

        • Gabriel Cooper

          #5
          Re: Remove HTML tags (except anchor tag) from a string using regularexpressi ons


          Max M wrote:
          [color=blue]
          > If it's not to learn, and you simply want it to work, try out this
          > library:
          >
          > http://zope.org/Members/chrisw/StripOGram/readme
          >
          >[color=green][color=darkred]
          >>> stripogram.html 2safehtml('''fi rst > last''',valid_t ags=('i','a','b r'))[/color][/color][/color]
          'first > last'[color=blue][color=green][color=darkred]
          >>> stripogram.html 2safehtml('''fi rst < last''',valid_t ags=('i','a','b r'))[/color][/color][/color]
          'first first '


          keeping in mind that bare ">" and "<" are invalid HTML (should be &gt;
          and &lt;), why'd it leave the greater than and why are there two "first"'s ?

          Comment

          Working...