codecs, csv issues

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • George Sakkis

    codecs, csv issues

    I'm trying to use codecs.open() and I see two issues when I pass
    encoding='utf8' :

    1) Newlines are hardcoded to LINEFEED (ascii 10) instead of the
    platform-specific byte(s).

    import codecs
    f = codecs.open('tm p.txt', 'w', encoding='utf8' )
    s = u'\u0391\u03b8\ u03ae\u03bd\u03 b1'
    print >f, s
    print >f, s
    f.close()

    This doesn't happen for the default encoding (=None).

    2) csv.writer doesn't seem to work as expected when being passed a
    codecs object; it treats it as if encoding is ascii:

    import codecs, csv
    f = codecs.open('tm p.txt', 'w', encoding='utf8' )
    s = u'\u0391\u03b8\ u03ae\u03bd\u03 b1'
    # this works fine
    print >f, s
    # this doesn't
    csv.writer(f).w riterow([s])
    f.close()

    Traceback (most recent call last):
    ....
    csv.writer(f).w riterow([s])
    UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\u0391' in
    position 0: ordinal not in range(128)

    Is this the expected behavior or are these bugs ?

    George
  • Peter Otten

    #2
    Re: codecs, csv issues

    George Sakkis wrote:
    I'm trying to use codecs.open() and I see two issues when I pass
    encoding='utf8' :
    >
    1) Newlines are hardcoded to LINEFEED (ascii 10) instead of the
    platform-specific byte(s).
    >
    import codecs
    f = codecs.open('tm p.txt', 'w', encoding='utf8' )
    s = u'\u0391\u03b8\ u03ae\u03bd\u03 b1'
    print >f, s
    print >f, s
    f.close()
    >
    This doesn't happen for the default encoding (=None).
    >
    2) csv.writer doesn't seem to work as expected when being passed a
    codecs object; it treats it as if encoding is ascii:
    >
    import codecs, csv
    f = codecs.open('tm p.txt', 'w', encoding='utf8' )
    s = u'\u0391\u03b8\ u03ae\u03bd\u03 b1'
    # this works fine
    print >f, s
    # this doesn't
    csv.writer(f).w riterow([s])
    f.close()
    >
    Traceback (most recent call last):
    ...
    csv.writer(f).w riterow([s])
    UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\u0391' in
    position 0: ordinal not in range(128)
    >
    Is this the expected behavior or are these bugs ?
    Looking into the documentation

    """
    Note: This version of the csv module doesn't support Unicode input. Also,
    there are currently some issues regarding ASCII NUL characters.
    Accordingly, all input should be UTF-8 or printable ASCII to be safe; see
    the examples in section 9.1.5. These restrictions will be removed in the
    future.
    """

    and into the source code

    if encoding is not None and \
    'b' not in mode:
    # Force opening of the file in binary mode
    mode = mode + 'b'

    I'd be willing to say that both are implementation limitations.

    Peter

    Comment

    • John Machin

      #3
      Re: codecs, csv issues

      On Aug 22, 11:52 pm, George Sakkis <george.sak...@ gmail.comwrote:
      I'm trying to use codecs.open() and I see two issues when I pass
      encoding='utf8' :
      >
      1) Newlines are hardcoded to LINEFEED (ascii 10) instead of the
      platform-specific byte(s).
      >
      import codecs
      f = codecs.open('tm p.txt', 'w', encoding='utf8' )
      s = u'\u0391\u03b8\ u03ae\u03bd\u03 b1'
      print >f, s
      print >f, s
      f.close()
      This is documented behaviour:
      """
      Note
      Files are always opened in binary mode, even if no binary mode was
      specified. This is done to avoid data loss due to encodings using 8-
      bit values. This means that no automatic conversion of '\n' is done on
      reading and writing.
      """

      Comment

      Working...