Is there a standard library for parsing emails that can cope with the
different way email clients quote?
AFAIK not - as unfortunately that's something the user can configure, and
thus no atrocity is unimaginable. Hard to write a module for that...
All you can try is to apply a heuristic like "if there are lines all
starting with a certain prefix that contains non-alphanumeric characters".
But then if the user configures to quote using
Phillip B Oldham <phillip.oldham @gmail.comwrite s:
Is there a standard library for parsing emails that can cope with
the different way email clients quote?
"Cope with" in what sense? i.e., what would the behaviour of such a
library be? What would it do?
Note also that it's not merely the mail client that does the quoting;
frequently the user composing the message will have a heavy hand in
how the quoted material appears.
--
\ “Time flies like an arrow. Fruit flies like a banana.†—Groucho |
`\ Marx |
_o__) |
Ben Finney
On Jul 30, 2:36 pm, Thomas Guettler <h...@tbz-pariv.dewrote:
What do you mean with "quote" here?
2. Prefix of quoted text like your text above in my mail
Basically, just be able to parse an email into its actual and "quoted"
parts - lines which have been prefixed to indent from a previous
email.
Most clients use ">" which is easy to check for, but I've seen some
which use "|" and some which *don't* quote at all. Its causing us
nightmares in parsing responses to system-generated emails. I was
hoping someone might've seen the problem previously and released some
code.
There are both shipped with python, email module and poplib, both very well
documented in the official doc (with examples and all).
email module is rather easy to use, and really powerful, but you'l need to
manage yourself the many ways email clients compose a message, and broken php
webmails that doesn't respect RFCs (notably about encoding)...
On Jul 30, 3:11 pm, Phillip B Oldham <phillip.old... @gmail.comwrote :
On Jul 30, 2:36 pm, Thomas Guettler <h...@tbz-pariv.dewrote:
>
What do you mean with "quote" here?
2. Prefix of quoted text like your text above in my mail
>
Basically, just be able to parse an email into its actual and "quoted"
parts - lines which have been prefixed to indent from a previous
email.
>
Most clients use ">" which is easy to check for, but I've seen some
which use "|" and some which *don't* quote at all. Its causing us
nightmares in parsing responses to system-generated emails. I was
hoping someone might've seen the problem previously and released some
code.
The problem is that sometimes lines might start with ">" for other
reasons, eg text copied from an interactive Python session, which
could occur in ... um ... _this_ newsgroup. :-)
For parsing the mails I would recommend pyparsing.
Why ? email module is a great parser IMO.
>
He talks about parsing the *content*, not the email envelope and possible
mime-body.
Yes ? I don't know what the OP want to do with the content, but if it's just
filtering the lines begining with a '>', pyparsing might be a bit
overweighted.
On Wed, 30 Jul 2008 07:11:45 -0700, Phillip B Oldham wrote:
Most clients use ">" which is easy to check for, but I've seen some
which use "|" and some which *don't* quote at all. Its causing us
nightmares in parsing responses to system-generated emails. I was hoping
someone might've seen the problem previously and released some code.
My sympathies.
I've even seen clients that prefix new (unquoted) text with the quote
character ">".
Well, possibly it's not the mail client, but the user. Who knows?
I will sometimes quote text like this:
[quote]
Something quoted.
[end quote]
But I'm writing for a human audience, not for a program.
The simple answer is that you can catch 90% of cases by checking for ">",
and another 1% by checking for "|". If the email contains HTML, I have
found that quoted text is sometimes in another colour. As for the rest,
well, sometimes even human beings can't easily determine what's quoted
and what isn't. Good luck getting a program to do it.
On Thu, 31 Jul 2008 02:25:37 +0000, Steven D'Aprano wrote:
On Wed, 30 Jul 2008 07:11:45 -0700, Phillip B Oldham wrote:
>
>Most clients use ">" which is easy to check for, but I've seen some
>which use "|" and some which *don't* quote at all. Its causing us
>nightmares in parsing responses to system-generated emails. I was
>hoping someone might've seen the problem previously and released some
>code.
>
My sympathies.
>
I've even seen clients that prefix new (unquoted) text with the quote
character ">".
Well, this is a new one I've never seen before: found on the python-dev
mailing list, somebody who (apparently) marks quoted text by inserting a
bare quote character on an otherwise empty line after each line of text,
similar to this:
I've even seen clients that prefix new (unquoted) text with the quote
>
character ">".
>
The user in question seems to be using gmail. I suspect a PEBCAK error.
Comment