usage of <string>.encode('utf-8','xmlcharrefreplace')?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • J Peyret

    usage of <string>.encode('utf-8','xmlcharrefreplace')?

    Well, as usual I am confused by unicode encoding errors.

    I have a string with problematic characters in it which I'd like to
    put into a postgresql table.
    That results in a postgresql error so I am trying to fix things with
    <string>.enco de
    >>s = 'he Company\xef\xbf \xbds ticker'
    >>print s
    he Company�s ticker
    >>>
    Trying for an encode:
    >>print s.encode('utf-8')
    Traceback (most recent call last):
    File "<input>", line 1, in <module>
    UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xef in position
    10: ordinal not in range(128)

    OK, that's pretty much as expected, I know this is not valid utf-8.
    But I should be able to fix this with the errors parameter of the
    encode method.
    >>error_repla ce = 'xmlcharrefrepl ace'
    >>print s.encode('utf-8',error_replac e)
    Traceback (most recent call last):
    File "<input>", line 1, in <module>
    UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xef in position
    10: ordinal not in range(128)

    Same exact error I got without the errors parameter.

    Did I mistype the error handler name? Nope.
    >>codecs.lookup _error(error_re place)
    <built-in function xmlcharrefrepla ce_errors>

    Same results with 'ignore' as an error handler.
    >>print s.encode('utf-8','ignore')
    Traceback (most recent call last):
    File "<input>", line 1, in <module>
    UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xef in position
    10: ordinal not in range(128)

    And with a bogus error handler:

    print s.encode('utf-8','bogus')
    Traceback (most recent call last):
    File "<input>", line 1, in <module>
    UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xef in position
    10: ordinal not in range(128)

    This all looks unusually complicated for Python.
    Am I missing something incredibly obvious?
    How does one use the errors parameter on strings' encode method?

    Also, why are the exceptions above complaining about the 'ascii' codec
    if I am asking for 'utf-8' conversion?

    Version and environment below. Should I try to update my python from
    somewhere?

    ./$ python
    Python 2.5.1 (r251:54863, Oct 5 2007, 13:36:32)
    [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2

    Cheers
  • Carsten Haese

    #2
    Re: usage of &lt;string&gt;. encode('utf-8','xmlcharrefr eplace')?

    On Mon, 18 Feb 2008 21:36:17 -0800 (PST), J Peyret wrote
    Well, as usual I am confused by unicode encoding errors.
    >
    I have a string with problematic characters in it which I'd like to
    put into a postgresql table.
    That results in a postgresql error so I am trying to fix things with
    <string>.enco de
    >
    >s = 'he Company\xef\xbf \xbds ticker'
    >print s
    he [UTF-8?]Company�s ticker
    >>
    >
    Trying for an encode:
    >
    >print s.encode('utf-8')
    Traceback (most recent call last):
    File "<input>", line 1, in <module>
    UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xef in position
    10: ordinal not in range(128)
    >
    OK, that's pretty much as expected, I know this is not valid utf-8.
    Actually, the string *is* valid UTF-8, but you're confused about encoding and
    decoding. Encoding is the process of turning a Unicode object into a byte
    string. Decoding is the process of turning a byte string into a Unicode object.

    You need to decode your byte string into a Unicode object, and then encode the
    result to a byte string in a different encoding. For example:
    >>s = 'he Company\xef\xbf \xbds ticker'
    >>s.decode("u tf-8").encode("asc ii", "xmlcharrefrepl ace")
    'he Company�s ticker'

    By the way, whether this is the correct fix for your PostgreSQL error is not
    clear, since you kept that error message a secret for some reason. There could
    be a better solution than transcoding the string in this way, but we won't
    know until you show us the actual error you're trying to fix. At the moment,
    it's like showing you the best way to inflate a tire with a hammer.

    Hope this helps,

    --
    Carsten Haese


    Comment

    • 7stud

      #3
      Re: usage of &lt;string&gt;. encode('utf-8','xmlcharrefr eplace')?

      To clarify a couple of points:

      On Feb 18, 11:38 pm, 7stud <bbxx789_0...@y ahoo.comwrote:
      > A unicode string looks like this:
      >
      s = u'\u0041'
      >
      but your string looks like this:
      >
      s = 'he Company\xef\xbf \xbds ticker'
      >
      Note that there is no 'u' in front of your string.  
      >
      That means your string is a regular string.

      If a python function requires a unicode string and a unicode string
      isn't provided..
      For example: encode().


      One last point: you can't display a unicode string. The very act of
      trying to print a unicode string causes it to be converted to a
      regular string. If you try to display a unicode string without
      explicitly encode()'ing it first, i.e. converting it to a regular
      string using a specified secret code--a so called 'codec', python will
      implicitly attempt to convert the unicode string to a regular string
      using the default codec, which is usually set to ascii.

      Comment

      • J Peyret

        #4
        Re: usage of &lt;string&gt;. encode('utf-8','xmlcharrefr eplace')?

        On Feb 18, 10:54 pm, 7stud <bbxx789_0...@y ahoo.comwrote:
        One last point: you can't display a unicode string. The very act of
        trying to print a unicode string causes it to be converted to a
        regular string. If you try to display a unicode string without
        explicitly encode()'ing it first, i.e. converting it to a regular
        string using a specified secret code--a so called 'codec', python will
        implicitly attempt to convert the unicode string to a regular string
        using the default codec, which is usually set to ascii.
        Yes, the string above was obtained by printing, which got it into
        ASCII format, as you picked up.
        Something else to watch out for when posting unicode issues.

        The solution I ended up with was

        1) Find out the encoding in the data file.

        In Ubuntu's gedit editor, menu 'Save As...' displays the encoding at
        the bottom of the save prompt dialog.

        ISO-8859-15 in my case.

        2) Look up encoding corresponding to ISO-8859-15 at



        3) Applying the decode/encode recipe suggested previously, for which I
        do understand the reason now.

        #converting rawdescr
        #from ISO-8859-15 (from the file)
        #to UTF-8 (what postgresql wants)
        #no error handler required.
        decodeddescr = rawdescr.decode ('iso8859_15'). encode('utf-8')

        postgresql insert is done using decodeddescr variable.

        Postgresql is happy, I'm happy.

        Comment

        • 7stud

          #5
          Re: usage of &lt;string&gt;. encode('utf-8','xmlcharrefr eplace')?

          On Feb 19, 12:15 am, J Peyret <jpey...@gmail. comwrote:
          On Feb 18, 10:54 pm, 7stud <bbxx789_0...@y ahoo.comwrote:
          >
          One last point: you can't display a unicode string.  The very act of
          trying to print a unicode string causes it to be converted to a
          regular string.  If you try to display a unicode string without
          explicitly encode()'ing it first, i.e. converting it to a regular
          string using a specified secret code--a so called 'codec', python will
          implicitly attempt to convert the unicode string to a regular string
          using the default codec, which is usually set to ascii.
          >
          Yes, the string above was obtained by printing, which got it into
          ASCII format, as you picked up.
          Something else to watch out for when posting unicode issues.
          >
          The solution I ended up with was
          >
          1) Find out the encoding in the data file.
          >
          In Ubuntu's gedit editor, menu 'Save As...' displays the encoding at
          the bottom of the save prompt dialog.
          >
          ISO-8859-15 in my case.
          >
          2) Look up encoding corresponding to ISO-8859-15 at
          >

          >
          3) Applying the decode/encode recipe suggested previously, for which I
          do understand the reason now.
          >
          #converting rawdescr
          #from ISO-8859-15 (from the file)
          #to UTF-8 (what postgresql wants)
          #no error handler required.
          decodeddescr = rawdescr.decode ('iso8859_15'). encode('utf-8')
          >
          postgresql insert is done using decodeddescr variable.
          >
          Postgresql is happy, I'm happy.
          Or, you can cheat. If you are reading from a file, you can make set
          it up so any string that you read from the file automatically gets
          converted from its encoding to another encoding. You don't even have
          to be aware of the fact that a regular string has to be converted into
          a unicode string before it can be converted to a regular string with a
          different encoding. Check out the codecs module and the EncodedFile()
          function:

          import codecs

          s = 'he Company\xef\xbf \xbds ticker'

          f = open('data2.txt ', 'w')
          f.write(s)
          f.close()

          f = open('data2.txt ')
          f_special = codecs.EncodedF ile(f, 'utf-8', 'iso8859_15') #file, new
          encoding, file's encoding
          print f_special.read( ) #If your display device understands utf-8, you
          will see the troublesome character displayed.
          #Are you sure that character is legitimate?

          f.close()
          f_special.close ()




          Comment

          Working...