spam classification breaker

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Robin Becker

    spam classification breaker

    This article at the BBC reports on what appears to be a genetic
    algorithm or random search method for finding words that apparently fool
    bayesian classifiers every time.

    BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service


    The author apparently had to include html reporting into the emails to
    allow his mail client to report back automatically.

    Of course if he'd used python the whole process of email generation and
    classification could have been done in a single process and would
    probably allow easier generation of the magic words.

    Why Berkshire, Marriot etc should be allowed through is pretty strange
    :)
    --
    Robin Becker
  • Skip Montanaro

    #2
    RE: spam classification breaker

    [color=blue][color=green]
    >> This article at the BBC reports on what appears to be a genetic
    >> algorithm or random search method for finding words that apparently
    >> fool bayesian classifiers every time.
    >>
    >> http://news.bbc.co.uk/1/hi/technology/3458457.stm[/color][/color]

    I noticed immediately that the author of the article used the term "ham" to
    refer to mail which was not spam. Even if SpamBayes dies an ignominious
    death in the future at the hands of some ruthless spammers, that will be our
    lasting legacy.

    Mr. Graham-Cumming could have avoided the overhead of sending himself 10,000
    mails by simply selecting words from his archived public presence on the
    net: web pages, Usenet posts or archived mailing list posts associated with
    his email address. I suspect his genetic algorithm would have been all but
    unnecessary. (Google for "John Graham-Cumming" for example.)

    This doesn't have to be a tedious process either. In the course of normal
    scumbag email harvesting, all the crawler has to do is select a few
    non-trivial words from the harvested page and associate them with the email
    address(es) on that page. After seeing the same email address a few times
    they would have a decent collection of hammy words for use in the "random
    words" block of later spam.

    Also, unlike the statement the author made:

    And, he said, this would have to be repeated for every person a spammer
    wanted to reach because they would all have a different list of key
    words.

    this wouldn't have to be done for all email addresses. Anything which
    increases the likelihood that a spam is opened will be seen as an
    improvement for the spammer. There's obviously no need for them to get a
    100% open rate on spam. If that was the case, they'd already all be out of
    business.

    These research types. They always do things in the hardest way possible...

    Skip

    Comment

    • John Graham-Cumming

      #3
      Re: spam classification breaker

      Skip Montanaro <skip@pobox.com > wrote in message news:<mailman.1 251.1076002061. 12720.python-list@python.org >...[color=blue]
      > Mr. Graham-Cumming could have avoided the overhead of sending himself 10,000
      > mails by simply selecting words from his archived public presence on the
      > net: web pages, Usenet posts or archived mailing list posts associated with
      > his email address. I suspect his genetic algorithm would have been all but
      > unnecessary. (Google for "John Graham-Cumming" for example.)
      >
      > This doesn't have to be a tedious process either. In the course of normal
      > scumbag email harvesting, all the crawler has to do is select a few
      > non-trivial words from the harvested page and associate them with the email
      > address(es) on that page. After seeing the same email address a few times
      > they would have a decent collection of hammy words for use in the "random
      > words" block of later spam.[/color]

      Yes, and I've tested this and its possible to find hammy words this
      way too, although it wasn't as effective as the technique I pointed
      out, nevertheless it is practical and in my experiments I looked at
      the uncommon words found in the locus of my email address and around
      40% were pure ham!

      Another way would be to spider the web page associated with the domain
      in the email address. e.g. to attack my address spider www.jgc.org.

      All of this indicates that it should be possible to attack Bayesian
      filters with a variety of techniques that rely on the fact that they
      are naive (i.e. they'll accept a hammy word no matter where it
      appears).

      John.

      Comment

      • Robin Becker

        #4
        Re: spam classification breaker

        In article <mailman.1245.1 075996814.12720 .python-list@python.org >, Tim
        Peters <tim.one@comcas t.net> writes
        ...
        .....[color=blue]
        >tomatically.
        >
        >If I'm a spammer trying to get my pitches seen by you, and you're using a
        >personal Bayesian classifier, then I need to load my pitches with words that
        >are very hammy to you. If I don't have access to your personal training
        >data (if I do, I already own your machine ...), then I need to *deduce*
        >what's hammy to you. One way to do that is, as John Graham-Cumming noted
        >here, is for me to send you thousands of messages with different piles of
        >words, and note which ones did and didn't get caught by your filter. Then
        >I load my sales pitches with words from the ones that your filter didn't
        >reject, and avoid words from ones your filter did reject. In order to do
        >that, I have to know which messages you did and didn't look at. That's the
        >purpose of the HTML "web bug"/"web beacon"s in the thousands of test
        >messages. (If your email client renders HTML pages, including fetching
        >images off the net, a spammer can know when you've rendered their message,
        >by, e.g., embedding your email address as a parameter in a URL that fetches
        >a .jpg to display.)[/color]
        ..... are you asserting that spammers don't have access to the pdf that
        users are filtering? Each filter may be unique, but they can be biassed.
        --
        Robin Becker

        Comment

        • Robin Becker

          #5
          Re: spam classification breaker

          In article <mailman.1276.1 076024613.12720 .python-list@python.org >, Tim
          Peters <tim.one@comcas t.net> writes[color=blue]
          >[Robin Becker][color=green]
          >> .... are you asserting that spammers don't have access to the pdf that
          >> users are filtering?[/color]
          >
          >Sorry, I couldn't make sense of that question.
          >[color=green]
          >> Each filter may be unique, but they can be biassed. --[/color]
          >[/color]
          .....OK I guess I'm trying to get at the following hand waving argument.
          Since most people agree about what is ham or spam there must be a
          general recognizer for each. My question is then, is whether it's
          possible to define a camouflage mechanism that turns ham into spam or
          vice versa. Most people reading a newspaper article would classify it as
          spam. If I insert a short ad v ert into the middle the quick
          scan process is gone, but I might be able if everything is
          set up correctly to get a forbidden word
          set into the text in plain si g ht even
          though it's specifically fo r bidden by your
          all singing and dancing B a yesian analyser. It is well known
          that word/space runs are very distracting which is why printers
          have long tried to eliminate them.

          I don't believe a small cost will kill all spam; every day I get large
          amounts of paper adverts, flyers, business cards etc etc. These have
          real cost, but presumably are sufficiently market oriented that they pay
          for themselves. Putting a cost on email will just reduce the volume of
          spam.
          --
          Robin Becker

          Comment

          Working...