Simple allowing of HTML elements/attributes?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Leif K-Brooks

    Simple allowing of HTML elements/attributes?

    I'm writing a site with mod_python which will have, among other things,
    forums. I want to allow users to use some HTML (<em>, <strong>, <p>,
    etc.) on the forums, but I don't want to allow bad elements and
    attributes (onclick, <script>, etc.). I would also like to do basic
    validation (no overlapping elements like <strong><em>foo </em></strong>,
    no missing end tags). I'm not asking anyone to write a script for me,
    but does anyone have general ideas about how to do this quickly on an
    active forum?
  • David M. Cooke

    #2
    Re: Simple allowing of HTML elements/attributes?

    At some point, Leif K-Brooks <eurleif@ecritt ers.biz> wrote:
    [color=blue]
    > I'm writing a site with mod_python which will have, among other
    > things, forums. I want to allow users to use some HTML (<em>,
    > <strong>, <p>, etc.) on the forums, but I don't want to allow bad
    > elements and attributes (onclick, <script>, etc.). I would also like
    > to do basic validation (no overlapping elements like
    > <strong><em>foo </em></strong>, no missing end tags). I'm not asking
    > anyone to write a script for me, but does anyone have general ideas
    > about how to do this quickly on an active forum?[/color]

    You could require valid XML, and use a validating XML parser to
    check conformance. You'd have to make sure the output is correctly
    quoted (for instance, check that HTML tags in a CDATA block get quoted).

    --
    |>|\/|<
    /--------------------------------------------------------------------------\
    |David M. Cooke
    |cookedm(at)phy sics(dot)mcmast er(dot)ca

    Comment

    • Graham Fawcett

      #3
      Re: Simple allowing of HTML elements/attributes?

      cookedm+news@ph ysics.mcmaster. ca (David M. Cooke) wrote in message news:<qnklln9gr v6.fsf@arbutus. physics.mcmaste r.ca>...[color=blue]
      > At some point, Leif K-Brooks <eurleif@ecritt ers.biz> wrote:
      >[color=green]
      > > I'm writing a site with mod_python which will have, among other
      > > things, forums. I want to allow users to use some HTML (<em>,
      > > <strong>, <p>, etc.) on the forums, but I don't want to allow bad
      > > elements and attributes (onclick, <script>, etc.). I would also like
      > > to do basic validation (no overlapping elements like
      > > <strong><em>foo </em></strong>, no missing end tags). I'm not asking
      > > anyone to write a script for me, but does anyone have general ideas
      > > about how to do this quickly on an active forum?[/color]
      >
      > You could require valid XML, and use a validating XML parser to
      > check conformance. You'd have to make sure the output is correctly
      > quoted (for instance, check that HTML tags in a CDATA block get quoted).[/color]

      You could use Tidy (or tidylib) to convert error-ridden input into
      valid HTML or XHTML, and then grab the BODY contents via an XML
      parser, as David suggested. I imagine that the library version of tidy
      is quick enough to meet your needs.

      Or maybe you could use XSLT to cut the "bad stuff" out of your tidied
      XHTML. (Not something I'm familiar with, but someone must have done
      this before.)

      There's a Python wrapper for tidylib at
      http://utidylib.sourceforge.net/ .

      -- Graham

      Comment

      • Alan Kennedy

        #4
        Re: Simple allowing of HTML elements/attributes?

        [Leif K-Brooks][color=blue][color=green][color=darkred]
        >>> I'm writing a site with mod_python which will have, among other
        >>> things, forums. I want to allow users to use some HTML (<em>,
        >>> <strong>, <p>, etc.) on the forums, but I don't want to allow bad
        >>> elements and attributes (onclick, <script>, etc.). I would also like
        >>> to do basic validation (no overlapping elements like
        >>> <strong><em>foo </em></strong>, no missing end tags). I'm not asking
        >>> anyone to write a script for me, but does anyone have general ideas
        >>> about how to do this quickly on an active forum?[/color][/color][/color]

        "Quickly" being an important consideration for you, I'm presuming.

        (David M. Cooke)[color=blue][color=green]
        >> You could require valid XML, and use a validating XML parser to
        >> check conformance. You'd have to make sure the output is correctly
        >> quoted (for instance, check that HTML tags in a CDATA block get quoted).[/color][/color]

        Hmmm, I'd imagine that the average forum user isn't going to know what
        well-formed XML is. Also, validating-XML support is one of the areas
        where python is lacking. Lastly, wrapping HTML tags in a CDATA block
        won't deliver much benefit. You still have to send that HTML to the
        browser, which will probably render the contents of the CDATA block
        anyway.

        [Graham Fawcett][color=blue]
        > You could use Tidy (or tidylib) to convert error-ridden input into
        > valid HTML or XHTML, and then grab the BODY contents via an XML
        > parser, as David suggested. I imagine that the library version of tidy
        > is quick enough to meet your needs.[/color]

        This is a good idea. Tidy is always a good way to get easily
        processable XML from badly-formed HTML. There are multiple ways to run
        Tidy from python: use MAL's utidy library, use the command line
        executable and pipes, or in jython use JTidy.

        Download JTidy for free. JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML.


        [Graham Fawcett][color=blue]
        > Or maybe you could use XSLT to cut the "bad stuff" out of your tidied
        > XHTML. (Not something I'm familiar with, but someone must have done
        > this before.)[/color]

        However, this is not a good idea. XSLT requires an Object Model of the
        document, meaning that you're going to use a lot of cpu-time and
        memory. In extreme cases, e.g. where some black-hat attempts to upload
        a 20 Mbyte HTML file, you're opening yourself up to a
        Denial-Of-Service attack, when your server tries to build up a [D]OM
        of that document.

        The optimal solution, IMHO, is to tidy the HTML into XML, and then use
        SAX to filter out the stuff you don't want. Here is some code that
        does the latter. This should be nice and fast, and use a lot less
        memory than object-model based approaches.

        #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
        import xml.sax
        import cStringIO as StringIO

        permittedElemen ts = ['html', 'body', 'b', 'i', 'p']
        permittedAttrs = ['class', 'id', ]

        class cleaner(xml.sax .handler.Conten tHandler):

        def __init__(self):
        xml.sax.handler .ContentHandler .__init__(self)
        self.outbuf = StringIO.String IO()

        def startElement(se lf, elemname, attrs):
        if elemname in permittedElemen ts:
        attrstr = ""
        for a in attrs.keys():
        if a in permittedAttrs:
        attrstr = "%s " % "%s='%s'" % (a, attrs[a])
        self.outbuf.wri te("<%s%s>" % (elemname, attrstr))

        def endElement(self , elemname):
        if elemname in permittedElemen ts:
        self.outbuf.wri te("</%s>" % (elemname,))

        def characters(self , s):
        self.outbuf.wri te("%s" % (s,))

        testdoc = """
        <html>
        <body>
        <p>This paragraph contains <b>only</b> permitted elements.</p>
        <p>This paragraph contains <i
        onclick="javasc ript:pop('porno .htm')">disallo wed
        attributes</i>.</p>
        <img src="http://www.blackhat.co m/session_hijack. gif"/>
        <p>This paragraph contains
        <a href="http://www.jscript-attack.com/">a potential script
        attack</a></p>
        </body>
        </html>
        """

        if __name__ == "__main__":
        parser = xml.sax.make_pa rser()
        mycleaner = cleaner()
        parser.setConte ntHandler(mycle aner)
        parser.setFeatu re(xml.sax.hand ler.feature_nam espaces, 0)
        parser.feed(tes tdoc)
        print mycleaner.outbu f.getvalue()
        #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

        Tidying the HTML to XML is left as an exercise to the reader ;-)

        HTH,

        --
        alan kennedy
        ------------------------------------------------------
        check http headers here: http://xhaus.com/headers
        email alan: http://xhaus.com/contact/alan

        Comment

        • Alan Kennedy

          #5
          Re: Simple allowing of HTML elements/attributes?

          [Alan Kennedy][color=blue]
          > The optimal solution, IMHO, is to tidy the HTML into XML, and then use
          > SAX to filter out the stuff you don't want. Here is some code that
          > does the latter. This should be nice and fast, and use a lot less
          > memory than object-model based approaches.[/color]

          Unfortunately, in my haste to post a demonstration of a technique
          earlier on, I posted running code that is both buggy and *INSECURE*.
          The following are problems with it

          1. A bug in making up the attribute string results in loss of
          permitted attributes.

          2. The failure to escape character data (i.e. map '<' to '&lt;' and
          '>' to '&gt;') as it is written out gives rise to the possibility of a
          code injection attack. It's easy to circumvent the check for malicious
          code: I'll leave to y'all to figure out how.

          3. I have a feeling that the failure to escape the attribute values
          also opens the possibility of a code injection attack. I'm not
          certain: it depends on the browser environment in which the final HTML
          is rendered.

          Anyway, here's some updated code that closes the SECURITY HOLES in the
          earlier-posted version :-(

          #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
          import xml.sax
          from xml.sax.saxutil s import escape, quoteattr
          import cStringIO as StringIO

          permittedElemen ts = ['html', 'body', 'b', 'i', 'p']
          permittedAttrs = ['class', 'id', ]

          class cleaner(xml.sax .handler.Conten tHandler):

          def __init__(self):
          xml.sax.handler .ContentHandler .__init__(self)
          self.outbuf = StringIO.String IO()

          def startElement(se lf, elemname, attrs):
          if elemname in permittedElemen ts:
          attrstr = ""
          for a in attrs.keys():
          if a in permittedAttrs:
          attrstr = "%s%s" % (attrstr, " %s=%s" % (a,
          quoteattr(attrs[a])))
          self.outbuf.wri te("<%s%s>" % (elemname, attrstr))

          def endElement(self , elemname):
          if elemname in permittedElemen ts:
          self.outbuf.wri te("</%s>" % (elemname,))

          def characters(self , s):
          self.outbuf.wri te("%s" % (escape(s),))

          testdoc = """
          <html>
          <body>
          <p class="1" id="2">This paragraph contains <b>only</b> permitted
          elements.</p>
          <p>This paragraph contains <i
          onclick="javasc ript:pop('porno .htm')">disallo wed
          attributes</i>.</p>
          <img src="http://www.blackhat.co m/session_hijack. gif"/>
          <p>This paragraph contains
          <script src="blackhat.j s"/>a potential script
          attack</p>
          </body>
          </html>
          """

          if __name__ == "__main__":
          parser = xml.sax.make_pa rser()
          mycleaner = cleaner()
          parser.setConte ntHandler(mycle aner)
          parser.setFeatu re(xml.sax.hand ler.feature_nam espaces, 0)
          parser.feed(tes tdoc)
          print mycleaner.outbu f.getvalue()
          #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

          regards,

          --
          alan kennedy
          ------------------------------------------------------
          check http headers here: http://xhaus.com/headers
          email alan: http://xhaus.com/contact/alan

          Comment

          Working...