Standard module for parsing emails?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Phillip B Oldham

    Standard module for parsing emails?

    Is there a standard library for parsing emails that can cope with the
    different way email clients quote?
  • Diez B. Roggisch

    #2
    Re: Standard module for parsing emails?

    Phillip B Oldham wrote:
    Is there a standard library for parsing emails that can cope with the
    different way email clients quote?
    AFAIK not - as unfortunately that's something the user can configure, and
    thus no atrocity is unimaginable. Hard to write a module for that...

    All you can try is to apply a heuristic like "if there are lines all
    starting with a certain prefix that contains non-alphanumeric characters".
    But then if the user configures to quote using

    XX

    you're doomed...



    Diez

    Comment

    • Ben Finney

      #3
      Re: Standard module for parsing emails?

      Phillip B Oldham <phillip.oldham @gmail.comwrite s:
      Is there a standard library for parsing emails that can cope with
      the different way email clients quote?
      "Cope with" in what sense? i.e., what would the behaviour of such a
      library be? What would it do?

      Note also that it's not merely the mail client that does the quoting;
      frequently the user composing the message will have a heavy hand in
      how the quoted material appears.

      --
      \ “Time flies like an arrow. Fruit flies like a banana.” —Groucho |
      `\ Marx |
      _o__) |
      Ben Finney

      Comment

      • Thomas Guettler

        #4
        Re: Standard module for parsing emails?

        Phillip B Oldham schrieb:
        Is there a standard library for parsing emails that can cope with the
        different way email clients quote?
        What do you mean with "quote" here?
        1. Encode utf8/latin1 to ascii
        2. Prefix of quoted text like your text above in my mail

        Thomas


        --
        Thomas Guettler, http://www.thomas-guettler.de/
        E-Mail: guettli (*) thomas-guettler + de

        Comment

        • Phillip B Oldham

          #5
          Re: Standard module for parsing emails?

          On Jul 30, 2:36 pm, Thomas Guettler <h...@tbz-pariv.dewrote:
          What do you mean with "quote" here?
          2. Prefix of quoted text like your text above in my mail
          Basically, just be able to parse an email into its actual and "quoted"
          parts - lines which have been prefixed to indent from a previous
          email.

          Most clients use ">" which is easy to check for, but I've seen some
          which use "|" and some which *don't* quote at all. Its causing us
          nightmares in parsing responses to system-generated emails. I was
          hoping someone might've seen the problem previously and released some
          code.

          Comment

          • Phillip B Oldham

            #6
            Re: Standard module for parsing emails?

            If there isn't a standard library for parsing emails, is there one for
            connecting to a pop/imap resource and reading the mailbox?

            Comment

            • Maric Michaud

              #7
              Re: Standard module for parsing emails?

              Le Wednesday 30 July 2008 17:15:07 Phillip B Oldham, vous avez écrit :
              If there isn't a standard library for parsing emails, is there one for
              connecting to a pop/imap resource and reading the mailbox?
              --
              http://mail.python.org/mailman/listinfo/python-list
              There are both shipped with python, email module and poplib, both very well
              documented in the official doc (with examples and all).

              email module is rather easy to use, and really powerful, but you'l need to
              manage yourself the many ways email clients compose a message, and broken php
              webmails that doesn't respect RFCs (notably about encoding)...

              --
              _____________

              Maric Michaud

              Comment

              • Aspersieman

                #8
                Re: Standard module for parsing emails?

                Phillip B Oldham wrote:
                If there isn't a standard library for parsing emails, is there one for
                connecting to a pop/imap resource and reading the mailbox?
                --

                >
                >
                The search [1] yielded these results:
                1) http://docs.python.org/lib/module-email.html
                2)
                Founded in 1997, DEVShed is the perfect place for web developers to learn, share their work, and build upon the ideas of others.


                I have used the email module very successfully.

                Also you can try the following to connect to mailboxes:
                1) poplib
                2) smtplib

                For parsing the mails I would recommend pyparsing.


                [1]


                Regards

                Nicolaas

                --

                The three things to remember about Llamas:
                1) They are harmless
                2) They are deadly
                3) They are made of lava, and thus nice to cuddle.


                Comment

                • Maric Michaud

                  #9
                  Re: Standard module for parsing emails?

                  Le Wednesday 30 July 2008 17:55:35 Aspersieman, vous avez écrit :
                  For parsing the mails I would recommend pyparsing.
                  Why ? email module is a great parser IMO.

                  --
                  _____________

                  Maric Michaud

                  Comment

                  • MRAB

                    #10
                    Re: Standard module for parsing emails?

                    On Jul 30, 3:11 pm, Phillip B Oldham <phillip.old... @gmail.comwrote :
                    On Jul 30, 2:36 pm, Thomas Guettler <h...@tbz-pariv.dewrote:
                    >
                    What do you mean with "quote" here?
                      2. Prefix of quoted text like your text above in my mail
                    >
                    Basically, just be able to parse an email into its actual and "quoted"
                    parts - lines which have been prefixed to indent from a previous
                    email.
                    >
                    Most clients use ">" which is easy to check for, but I've seen some
                    which use "|" and some which *don't* quote at all. Its causing us
                    nightmares in parsing responses to system-generated emails. I was
                    hoping someone might've seen the problem previously and released some
                    code.
                    The problem is that sometimes lines might start with ">" for other
                    reasons, eg text copied from an interactive Python session, which
                    could occur in ... um ... _this_ newsgroup. :-)

                    Comment

                    • Diez B. Roggisch

                      #11
                      Re: Standard module for parsing emails?

                      Maric Michaud wrote:
                      Le Wednesday 30 July 2008 17:55:35 Aspersieman, vous avez écrit :
                      >For parsing the mails I would recommend pyparsing.
                      >
                      Why ? email module is a great parser IMO.
                      He talks about parsing the *content*, not the email envelope and possible
                      mime-body.

                      Diez

                      Comment

                      • Maric Michaud

                        #12
                        Re: Standard module for parsing emails?

                        Le Wednesday 30 July 2008 19:25:31 Diez B. Roggisch, vous avez écrit :
                        Maric Michaud wrote:
                        Le Wednesday 30 July 2008 17:55:35 Aspersieman, vous avez écrit :
                        For parsing the mails I would recommend pyparsing.
                        Why ? email module is a great parser IMO.
                        >
                        He talks about parsing the *content*, not the email envelope and possible
                        mime-body.
                        Yes ? I don't know what the OP want to do with the content, but if it's just
                        filtering the lines begining with a '>', pyparsing might be a bit
                        overweighted.

                        --
                        _____________

                        Maric Michaud

                        Comment

                        • Steven D'Aprano

                          #13
                          Re: Standard module for parsing emails?

                          On Wed, 30 Jul 2008 07:11:45 -0700, Phillip B Oldham wrote:
                          Most clients use ">" which is easy to check for, but I've seen some
                          which use "|" and some which *don't* quote at all. Its causing us
                          nightmares in parsing responses to system-generated emails. I was hoping
                          someone might've seen the problem previously and released some code.
                          My sympathies.

                          I've even seen clients that prefix new (unquoted) text with the quote
                          character ">".

                          Well, possibly it's not the mail client, but the user. Who knows?

                          I will sometimes quote text like this:

                          [quote]
                          Something quoted.
                          [end quote]

                          But I'm writing for a human audience, not for a program.

                          The simple answer is that you can catch 90% of cases by checking for ">",
                          and another 1% by checking for "|". If the email contains HTML, I have
                          found that quoted text is sometimes in another colour. As for the rest,
                          well, sometimes even human beings can't easily determine what's quoted
                          and what isn't. Good luck getting a program to do it.

                          (Percentages are plucked out of thin air. YMMV.)


                          --
                          Steven

                          Comment

                          • Steven D'Aprano

                            #14
                            Re: Standard module for parsing emails?

                            On Thu, 31 Jul 2008 02:25:37 +0000, Steven D'Aprano wrote:
                            On Wed, 30 Jul 2008 07:11:45 -0700, Phillip B Oldham wrote:
                            >
                            >Most clients use ">" which is easy to check for, but I've seen some
                            >which use "|" and some which *don't* quote at all. Its causing us
                            >nightmares in parsing responses to system-generated emails. I was
                            >hoping someone might've seen the problem previously and released some
                            >code.
                            >
                            My sympathies.
                            >
                            I've even seen clients that prefix new (unquoted) text with the quote
                            character ">".

                            Well, this is a new one I've never seen before: found on the python-dev
                            mailing list, somebody who (apparently) marks quoted text by inserting a
                            bare quote character on an otherwise empty line after each line of text,
                            similar to this:

                            I've even seen clients that prefix new (unquoted) text with the quote
                            >
                            character ">".
                            >
                            The user in question seems to be using gmail. I suspect a PEBCAK error.



                            --
                            Steven

                            Comment

                            Working...