Reading in a UTF-8 file but causing a UnicodeDecodeError exception

**Benny the Guard** · Aug 11 '09, 10:06 PM

More context. I loaded the file in emacs with hex mode and I see efbbbf which should indicate UTF-8. So the questions are:

1. Why does vi show '<feff>' as the first char?
2. Why does the first code snipet I show not strip the control character?
3. Why when using codecs.open and forcing UTF-8 does it replace the control charcater with feff?
4. Does Python have a way to read in a file with auto-detection for encoding?

**bvdet** · Aug 12 '09, 01:49 AM

Evan Jones has a good explanation here. I did a little test on my system reading a UTF-8 file with Python 2.3.

Code:

#UTF-8
s1 = open('unicode_example1.txt', 'r').read()
print repr(s1.decode("UTF-8"))
if s1.startswith(codecs.BOM_UTF8):
    s1 = s1.lstrip(codecs.BOM_UTF8)
print repr(s1)

And the output:

Code:

>>> u'abcdef'
'abcdef'

Apparently object s1 is now a simple string.

Code:

>>> s1
'abcdef'
>>> unicode(s1, 'UTF-8')
u'abcdef'
>>>

**Benny the Guard** · Aug 12 '09, 04:34 AM

Thanks! This helps alot. I agree with Evan that this seems like a bug in Python coedecs, in that it strips the BOM from UTF-16 but not UTF-8. But adding:

lstrip ( unicode( codecs.BOM_UTF8 , "utf8" ) )

Works wonders. Had to use the unicode stuff to avoid an excpetion due to non-ascii characters, but it now works nicely.

Reading in a UTF-8 file but causing a UnicodeDecodeError exception

Reading in a UTF-8 file but causing a UnicodeDecodeError exception

Comment

Comment

Comment