email header decoding fails

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • ZeeGeek

    email header decoding fails

    It seems that the decode_header function in email.Header fails when
    the string is in the following form,

    '=?gb2312?Q?=D0 =C7=C8=FC?=(rev ised)'

    That's when a non-encoded string follows the encoded string without
    any whitespace. In this case, decode_header function treats the whole
    string as non-encoded. Is there a work around for this problem?

    Thanks.
  • Gabriel Genellina

    #2
    Re: email header decoding fails

    En Thu, 10 Apr 2008 05:45:41 -0300, ZeeGeek <ZeeGeek@gmail. comescribió:
    On Apr 10, 4:31 pm, "Gabriel Genellina" <gagsl-...@yahoo.com.a r>
    wrote:
    >En Wed, 09 Apr 2008 23:12:00 -0300, ZeeGeek <ZeeG...@gmail. com>
    >escribió:
    >>
    It seems that the decode_header function in email.Header fails when
    the string is in the following form,
    >>
    '=?gb2312?Q?=D0 =C7=C8=FC?=(rev ised)'
    > An 'encoded-word' that appears within a
    > 'phrase' MUST be separated from any adjacent 'word', 'text' or
    > 'special' by 'linear-white-space'.
    >
    Thank you very much, Gabriel.
    The above just says "why" decode_header refuses to decode it, and why it's
    not a bug. But if you actually have to deal with those malformed headers,
    some heuristics may help. By example, if you *know* your mails typically
    specify gb2312 encoding, or iso-8859-1, you may look for things that look
    like the example above and "fix" it.

    --
    Gabriel Genellina

    Comment

    • ZeeGeek

      #3
      Re: email header decoding fails

      On Apr 10, 5:18 pm, "Gabriel Genellina" <gagsl-...@yahoo.com.a r>
      wrote:
      En Thu, 10 Apr 2008 05:45:41 -0300, ZeeGeek <ZeeG...@gmail. comescribió:
      >
      On Apr 10, 4:31 pm, "Gabriel Genellina" <gagsl-...@yahoo.com.a r>
      wrote:
      En Wed, 09 Apr 2008 23:12:00 -0300, ZeeGeek <ZeeG...@gmail. com>
      escribió:
      >
      It seems that the decode_header function in email.Header fails when
      the string is in the following form,
      >
      '=?gb2312?Q?=D0 =C7=C8=FC?=(rev ised)'
      An 'encoded-word' that appears within a
      'phrase' MUST be separated from any adjacent 'word', 'text' or
      'special' by 'linear-white-space'.
      >
      Thank you very much, Gabriel.
      >
      The above just says "why" decode_header refuses to decode it, and why it's
      not a bug. But if you actually have to deal with those malformed headers,
      some heuristics may help. By example, if you *know* your mails typically
      specify gb2312 encoding, or iso-8859-1, you may look for things that look
      like the example above and "fix" it.
      Right now what I'm doing is to use re.sub(r'(=\?([^\?]*\?){3}=)', r'
      \1 ', orig_string) to detect and place an extra white space before and
      after every occurrence of an encoded string. Then the whole string is
      compliant with the standard and decode_header can decode it properly.

      Comment

      Working...