Pickling Unicode

**bvdet** · Jan 20 '09, 03:39 PM

I was able to pickle and unpickle your unicode characters by using cPickle, encoding the string to UTF-8, and decoding the loaded string. I am not sure if this will help you though.

Code:

import codecs
import cPickle

str1 = u'lǘelü'
print
print str1, repr(str1)
str1U = str1.encode("UTF-8")
print repr(str1U)

PickleStr1 = cPickle.dumps(str1U)
SPickleFile = codecs.open('SpFile.utf', 'w', 'utf-8')
SPickleFile.write(PickleStr1)
SPickleFile.close()


f = codecs.open('SpFile.utf', 'r', 'utf-8')
str2U = cPickle.load(f)
print repr(str2U)
str2 = str2U.decode("UTF-8")
print str2, repr(str2)

Output:

Code:

>>> 
luelü u'luel\xfc'
'luel\xc3\xbc'
'luel\xc3\xbc'
luelü u'luel\xfc'
>>>

**bigturtle** · Jan 27 '09, 04:24 AM

Finally got your solution to work. There are a couple of things I don't understand about it.

I like the idea: you flatten all the Unicode out of the string by changing all the Unicode to ASCII encodings, store it in a file as ASCII, then read it in and reverse the process. Here's a version that works for me.

Code:

import codecs
import cPickle
 
str1 = u'lǘelü'
print "Pickling"
print "str1 [" + repr(str1) + "]"
str1U = str1.encode("UTF-8")
print "str1U [" + repr(str1U) + "]"
PickleStr1 = cPickle.dumps(str1U)
SPickleFile = codecs.open('SpFile.utf', 'w')
SPickleFile.write(PickleStr1)
SPickleFile.close()

print "\nUnpickling"
f = codecs.open('SpFile.utf', 'r')
str2U = cPickle.load(f)
print "str2U [" + repr(str2U) + "]"
str2 = str2U.decode("UTF-8")
print "str2 [" + repr(str2) + "]"

Output:

Code:

Pickling
str1 [u'l\u01d8el\xfc']
str1U ['l\xc7\x98el\xc3\xbc']

Unpickling
str2U ['l\xc7\x98el\xc3\xbc']
str2 [u'l\u01d8el\xfc']

Comments:

1. I can't print Unicode strings at all using "print". How do you do it?

2. You specify "UTF-8" both on your input file and your output file, but I think this can't be right. On the output file it doesn't matter since the file is anyhow ASCII. But on the input file it's fatal. (After all, the whole point is that the contents are ASCII.) You get the error

str2U = cPickle.load(f)
UnpicklingError : pickle data was truncated

3. I didn't think it's possible to dump a string using pickle.dumps() and load it back in using pickle.load(). But it works, much to my surprise! The alternative is to replace your assignment to Str2U by

Code:

PickleStr2 = f.read()
str2U = cPickle.loads(PickleStr2)

Thanks for your help. Sorry not to reply earlier, but now I have settled down in China and have a bit of time.

**bvdet** · Jan 27 '09, 04:47 PM

bigturtle,

Canada to China - that's a big move! I wish I could better explain Unicode behavior, but I am learning about it myself. I do not use any Unicode in my work. Python 3.0 unifies Unicode and 8-bit strings into the str type.

I don't understand why you cannot print Unicode strings, unless the behavior of 2.6 is different from 2.3, which is what I am using.

Code:

>>> str1 = u'luelü'
>>> print str1
luelü
>>>

**bigturtle** · Jan 28 '09, 02:40 AM

In your example, you have included in your test string 'ü' (ASCII 252 = u'\xFC') but not 'ǘ' (Unicode u'\u01D8'). It appears that there are three classes of characters:
. 7-bit ASCII (0-127 = u'\x00' - u'\x7F')
. "upper ASCII" (128-255 = u'\x80' - u'\xFF')
. full 2-byte Unicode (u'\u0100' - u'\uFFFF')
Codes 128-255 give problems to some routines because they are neither straight ASCII nor 2-byte Unicode.

In your example all the codes above 127 give trouble for me, depending. Here's my complete source file. The second line, which declares the encoding of the source file as Unicode, is required. Note 'ǘ' in the test string.

Code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

str1 = u'lǘelü'
print str1

Output:

Code:

    print str1
UnicodeEncodeError: 'ascii' codec can't encode character u'\u01d8' in position 1: ordinal not in range(128)

If I change the test string to u'lüelǘ', I get this output:

Code:

    print str1
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 1: ordinal not in range(128)

This seems to show that both the non-ASCII characters make "print" choke: it chokes on whichever one comes first.

HOWEVER, if I use the test string u'luelü' with no 2-byte Unicode char, as in your reply, it prints with no problem. Go figure!

It might be useful to know what your system does with my two test strings above... if we care.

Finally: If Python 3.0 has finally unified the string type to include Unicode, that's a real good reason for me to change. Thanks for the tip!

**bigturtle** · Feb 1 '09, 09:05 AM

Yay Python 3!

I have now switched to Python 3.0 and find that most of my problems have gone away. The pickle module works fine for Unicode, since all strings are anyhow Unicode. So no more u'...' in front of Unicode strings.

FYI here are some things I had to watch out for. There is no more codecs module, and so Unicode input files are specified by

FH = open(FileName, 'r', encoding='utf-8')

and output files the same with 'w'. Pickle files have to be specified as binary:

PFH = open(PickleFile Name, 'rb')

or 'wb', depending.

The print command has changed to print(), and fouls up the same way it did before. Do you know how to specify the encoding on sys.stdout?

Thanks for all your help.

**bvdet** · Feb 2 '09, 04:04 PM

Originally posted by bigturtle

The print command has changed to print(), and fouls up the same way it did before. Do you know how to specify the encoding on sys.stdout?

That's a good question, and I don't know the answer. Have you looked at sys.setdefaulte ncoding(name) or codecs.StreamWr iter(stream[, errors]) and codecs.getwrite r(encoding)? There may be a way to redefine print() to handle your encoding problem. Also look into io.

HTH, BV

**Stress** · Feb 5 '09, 10:45 AM

Peace, friends,

You're mixing things up: serialization (pickling) gives you a binary representation of any Python object, Unicode text included.

If you open a file in text mode, and tell Python that it contains text encoded as UTF-8, then obviously you shouldn't be writing binary data (byte arrays, "bytes" in Python 3), such as pickled stuff, to it.

Put your pickles in a binary file. What you read/write from/to a UTF-8 encoded file is Unicode text ("str" in Python 3, right?), that gets automatically de/encoded for you.

OK?
;o)