Pickling Unicode

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • bigturtle
    New Member
    • Apr 2007
    • 19

    Pickling Unicode

    Using Python 2.6, I am trying to pickle a dictionary (for Chinese pinyin) which contains both Unicode characters in the range 128-255 and 4-byte Unicode characters. I get allergic reactions from pickle.dump() under all protocols.

    Here’s a simple test program:
    Code:
    # Program 1 (protocol 0), program 2 (protocol 2)
      PickleFile = codecs.open('PFile.utf', 'w', 'utf-8')    
      Str1 = u'lǘelü' 
      pickle.dump(Str1, PickleFile, protocol=0) # Error here!
      PickleFile.close()
    1. Attempting to run this gives the error:
    UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xfc in position 10: ordinal not in range(128)
    This is understandable, since protocol 0 is strictly ASCII and 0xfc is the character 'ü'.

    2. With protocol=2 (or -1) I get a different, more mysterious error:
    UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128)

    Well, let's try to use pickle.dumps() (which DOES work) and store the resulting string in a file.
    Code:
    # Program 3. Using pickle.dumps()
          Str1 = u'lǘelü'
          PickleStr1 = pickle.dumps(Str1) # So far so good!
    
          SPickleFile = codecs.open('SpFile.utf', 'w', 'utf-8')
          SPickleFile.write(PickleStr1) # Error here!
          close(SPickleFile)
    3. Running this program, I get the error “can’t decode byte 0xfc in position 10” as in program 1.

    Isn’t this horribly, and uselessly, frustrating?? The pickle module has been around long enough not to stub its toes on this dinky example. Or is there something I have missed?

    There is a long discussion of this issue in Issue 2980: Pickle stream for unicode object may contain non-ASCII characters. - Python tracker , which seems to address this problem but does not solve it that I can see.

    Thank you all for your help & understanding.
  • bvdet
    Recognized Expert Specialist
    • Oct 2006
    • 2851

    #2
    I was able to pickle and unpickle your unicode characters by using cPickle, encoding the string to UTF-8, and decoding the loaded string. I am not sure if this will help you though.
    Code:
    import codecs
    import cPickle
    
    str1 = u'lǘelü'
    print
    print str1, repr(str1)
    str1U = str1.encode("UTF-8")
    print repr(str1U)
    
    PickleStr1 = cPickle.dumps(str1U)
    SPickleFile = codecs.open('SpFile.utf', 'w', 'utf-8')
    SPickleFile.write(PickleStr1)
    SPickleFile.close()
    
    
    f = codecs.open('SpFile.utf', 'r', 'utf-8')
    str2U = cPickle.load(f)
    print repr(str2U)
    str2 = str2U.decode("UTF-8")
    print str2, repr(str2)
    Output:
    Code:
    >>> 
    luelü u'luel\xfc'
    'luel\xc3\xbc'
    'luel\xc3\xbc'
    luelü u'luel\xfc'
    >>>

    Comment

    • bigturtle
      New Member
      • Apr 2007
      • 19

      #3
      Finally got your solution to work. There are a couple of things I don't understand about it.

      I like the idea: you flatten all the Unicode out of the string by changing all the Unicode to ASCII encodings, store it in a file as ASCII, then read it in and reverse the process. Here's a version that works for me.
      Code:
      import codecs
      import cPickle
       
      str1 = u'lǘelü'
      print "Pickling"
      print "str1 [" + repr(str1) + "]"
      str1U = str1.encode("UTF-8")
      print "str1U [" + repr(str1U) + "]"
      PickleStr1 = cPickle.dumps(str1U)
      SPickleFile = codecs.open('SpFile.utf', 'w')
      SPickleFile.write(PickleStr1)
      SPickleFile.close()
      
      print "\nUnpickling"
      f = codecs.open('SpFile.utf', 'r')
      str2U = cPickle.load(f)
      print "str2U [" + repr(str2U) + "]"
      str2 = str2U.decode("UTF-8")
      print "str2 [" + repr(str2) + "]"
      Output:
      Code:
      Pickling
      str1 [u'l\u01d8el\xfc']
      str1U ['l\xc7\x98el\xc3\xbc']
      
      Unpickling
      str2U ['l\xc7\x98el\xc3\xbc']
      str2 [u'l\u01d8el\xfc']
      Comments:

      1. I can't print Unicode strings at all using "print". How do you do it?

      2. You specify "UTF-8" both on your input file and your output file, but I think this can't be right. On the output file it doesn't matter since the file is anyhow ASCII. But on the input file it's fatal. (After all, the whole point is that the contents are ASCII.) You get the error

      str2U = cPickle.load(f)
      UnpicklingError : pickle data was truncated


      3. I didn't think it's possible to dump a string using pickle.dumps() and load it back in using pickle.load(). But it works, much to my surprise! The alternative is to replace your assignment to Str2U by
      Code:
      PickleStr2 = f.read()
      str2U = cPickle.loads(PickleStr2)
      Thanks for your help. Sorry not to reply earlier, but now I have settled down in China and have a bit of time.

      Comment

      • bvdet
        Recognized Expert Specialist
        • Oct 2006
        • 2851

        #4
        bigturtle,

        Canada to China - that's a big move! I wish I could better explain Unicode behavior, but I am learning about it myself. I do not use any Unicode in my work. Python 3.0 unifies Unicode and 8-bit strings into the str type.

        I don't understand why you cannot print Unicode strings, unless the behavior of 2.6 is different from 2.3, which is what I am using.
        Code:
        >>> str1 = u'luelü'
        >>> print str1
        luelü
        >>>

        Comment

        • bigturtle
          New Member
          • Apr 2007
          • 19

          #5
          In your example, you have included in your test string 'ü' (ASCII 252 = u'\xFC') but not 'ǘ' (Unicode u'\u01D8'). It appears that there are three classes of characters:
          . 7-bit ASCII (0-127 = u'\x00' - u'\x7F')
          . "upper ASCII" (128-255 = u'\x80' - u'\xFF')
          . full 2-byte Unicode (u'\u0100' - u'\uFFFF')
          Codes 128-255 give problems to some routines because they are neither straight ASCII nor 2-byte Unicode.

          In your example all the codes above 127 give trouble for me, depending. Here's my complete source file. The second line, which declares the encoding of the source file as Unicode, is required. Note 'ǘ' in the test string.
          Code:
          #!/usr/bin/env python
          # -*- coding: utf-8 -*-
          
          str1 = u'lǘelü'
          print str1
          Output:
          Code:
              print str1
          UnicodeEncodeError: 'ascii' codec can't encode character u'\u01d8' in position 1: ordinal not in range(128)
          If I change the test string to u'lüelǘ', I get this output:
          Code:
              print str1
          UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 1: ordinal not in range(128)
          This seems to show that both the non-ASCII characters make "print" choke: it chokes on whichever one comes first.

          HOWEVER, if I use the test string u'luelü' with no 2-byte Unicode char, as in your reply, it prints with no problem. Go figure!

          It might be useful to know what your system does with my two test strings above... if we care.

          Finally: If Python 3.0 has finally unified the string type to include Unicode, that's a real good reason for me to change. Thanks for the tip!

          Comment

          • bigturtle
            New Member
            • Apr 2007
            • 19

            #6
            Yay Python 3!

            I have now switched to Python 3.0 and find that most of my problems have gone away. The pickle module works fine for Unicode, since all strings are anyhow Unicode. So no more u'...' in front of Unicode strings.

            FYI here are some things I had to watch out for. There is no more codecs module, and so Unicode input files are specified by

            FH = open(FileName, 'r', encoding='utf-8')

            and output files the same with 'w'. Pickle files have to be specified as binary:

            PFH = open(PickleFile Name, 'rb')

            or 'wb', depending.

            The print command has changed to print(), and fouls up the same way it did before. Do you know how to specify the encoding on sys.stdout?

            Thanks for all your help.

            Comment

            • bvdet
              Recognized Expert Specialist
              • Oct 2006
              • 2851

              #7
              Originally posted by bigturtle
              The print command has changed to print(), and fouls up the same way it did before. Do you know how to specify the encoding on sys.stdout?
              That's a good question, and I don't know the answer. Have you looked at sys.setdefaulte ncoding(name) or codecs.StreamWr iter(stream[, errors]) and codecs.getwrite r(encoding)? There may be a way to redefine print() to handle your encoding problem. Also look into io.

              HTH, BV

              Comment

              • Stress
                New Member
                • Feb 2009
                • 1

                #8
                Peace, friends,

                You're mixing things up: serialization (pickling) gives you a binary representation of any Python object, Unicode text included.

                If you open a file in text mode, and tell Python that it contains text encoded as UTF-8, then obviously you shouldn't be writing binary data (byte arrays, "bytes" in Python 3), such as pickled stuff, to it.

                Put your pickles in a binary file. What you read/write from/to a UTF-8 encoded file is Unicode text ("str" in Python 3, right?), that gets automatically de/encoded for you.

                OK?
                ;o)

                Comment

                Working...