Reading in a UTF-8 file but causing a UnicodeDecodeError exception

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • Benny the Guard
    New Member
    • Jun 2007
    • 92

    Reading in a UTF-8 file but causing a UnicodeDecodeError exception

    I have a CSV file created by VisualBasic in UTF-8. If I open the file in vi/emacs I see the Byte-Order marker (BOM), <feff>


    So now when I read the file:

    Code:
    import codecs
    f = open ('myfile')
    test = f.readline ()
    print test.decode ('utf-8')
    It prints a control character (u'\u\xef\xbb\x bf) as its first character. Shouldn't the decode strip this? Also tried the following to see what would happen and try to auto-detect the format:

    Code:
    import codecs
    for encoding in ['utf-8', 'utf-16']:
        try:
                f = codecs.open ('myfile', encoding=encoding)
                test = f.readline ()
                test
        except Exception, exc:
                f = None
                print (exc)
    For UTF-16 this is weird cause it states "UTF-16 stream does not start with BOM" even though the first char is the BOM. For UTF-8 no errors but it prints the control characters (u'\ufeff)

    Any ideas what is going on with this? Possibly a badly encoded file?
  • Benny the Guard
    New Member
    • Jun 2007
    • 92

    #2
    More context. I loaded the file in emacs with hex mode and I see efbbbf which should indicate UTF-8. So the questions are:

    1. Why does vi show '<feff>' as the first char?
    2. Why does the first code snipet I show not strip the control character?
    3. Why when using codecs.open and forcing UTF-8 does it replace the control charcater with feff?
    4. Does Python have a way to read in a file with auto-detection for encoding?

    Comment

    • bvdet
      Recognized Expert Specialist
      • Oct 2006
      • 2851

      #3
      Evan Jones has a good explanation here. I did a little test on my system reading a UTF-8 file with Python 2.3.
      Code:
      #UTF-8
      s1 = open('unicode_example1.txt', 'r').read()
      print repr(s1.decode("UTF-8"))
      if s1.startswith(codecs.BOM_UTF8):
          s1 = s1.lstrip(codecs.BOM_UTF8)
      print repr(s1)
      And the output:
      Code:
      >>> u'abcdef'
      'abcdef'
      Apparently object s1 is now a simple string.

      Code:
      >>> s1
      'abcdef'
      >>> unicode(s1, 'UTF-8')
      u'abcdef'
      >>>

      Comment

      • Benny the Guard
        New Member
        • Jun 2007
        • 92

        #4
        Thanks! This helps alot. I agree with Evan that this seems like a bug in Python coedecs, in that it strips the BOM from UTF-16 but not UTF-8. But adding:

        lstrip ( unicode( codecs.BOM_UTF8 , "utf8" ) )

        Works wonders. Had to use the unicode stuff to avoid an excpetion due to non-ascii characters, but it now works nicely.

        Comment

        Working...