Validating XML/XHTML in email

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Kenneth Porter

    Validating XML/XHTML in email

    I'm thinking it might be a good idea to use the "quality" of an XML/XHTML
    email's structure as a metric for spamminess. More errors are likely to
    imply spam. Does there exist a lightweight validator that can quickly
    produce a metric of how many errors exist in a message? Ideally this would
    be something I could invoke from a Perl process, perhaps over a pipe to a
    validation server (similar to the way ClamAV and SpamAssassin can be
    invoked).
  • Peter Flynn

    #2
    Re: Validating XML/XHTML in email

    Kenneth Porter wrote:
    I'm thinking it might be a good idea to use the "quality" of an XML/XHTML
    email's structure as a metric for spamminess. More errors are likely to
    imply spam. Does there exist a lightweight validator that can quickly
    produce a metric of how many errors exist in a message? Ideally this would
    be something I could invoke from a Perl process, perhaps over a pipe to a
    validation server (similar to the way ClamAV and SpamAssassin can be
    invoked).

    onsgmls -wxml -s -E 5000 xml.dcl yourfile.xml 2>&1 | grep ':E:' | wc -l

    onsgmls is in the OpenSP package.

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/

    Comment

    • Kenneth Porter

      #3
      Re: Validating XML/XHTML in email

      Peter Flynn <peter.nosp@m.s ilmaril.iewrote in news:6mcaioFg2i rhU1
      @mid.individual .net:
      onsgmls -wxml -s -E 5000 xml.dcl yourfile.xml 2>&1 | grep ':E:' | wc -l
      >
      onsgmls is in the OpenSP package.
      That sounds good. Now to see what's involved in incorporating that into a
      SpamAssassin plugin....

      Comment

      • Kenneth Porter

        #4
        Re: Validating XML/XHTML in email

        Peter Flynn <peter.nosp@m.s ilmaril.iewrote in news:6mcaioFg2i rhU1
        @mid.individual .net:
        onsgmls -wxml -s -E 5000 xml.dcl yourfile.xml 2>&1 | grep ':E:' | wc -l
        >
        onsgmls is in the OpenSP package.
        With that hint I found that "tidy -eq" gives a pretty good result. To
        normalize the score, I figure it makes sense to divide the resulting line
        count by the byte count of the input file.

        Comment

        • Peter Flynn

          #5
          Re: Validating XML/XHTML in email

          Kenneth Porter wrote:
          Peter Flynn <peter.nosp@m.s ilmaril.iewrote in news:6mcaioFg2i rhU1
          @mid.individual .net:
          >
          >onsgmls -wxml -s -E 5000 xml.dcl yourfile.xml 2>&1 | grep ':E:' | wc -l
          >>
          >onsgmls is in the OpenSP package.
          >
          With that hint I found that "tidy -eq" gives a pretty good result. To
          normalize the score, I figure it makes sense to divide the resulting line
          count by the byte count of the input file.
          Ah. If it's only HTML you're handling, Tidy will be much easier to work
          with. OpenSP requires well-formed XML at least, which would mean running
          Tidy on the HTML first anyway.

          ///Peter

          Comment

          Working...