Psycopg and queries with UTF-8 data

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Alban Hertroys

    Psycopg and queries with UTF-8 data

    Another python/psycopg question, for which the solution is probably
    quite simple; I just don't know where to look.

    I have a query that inserts data originating from an utf-8 encoded XML
    file. And guess what, it contains utf-8 encoded characters...
    Now my problem is that psycopg will only accept queries of type str, so
    how do I get my utf-8 encoded data into the DB?

    I can't do query.encode('a scii'), that would be similar to:[color=blue][color=green][color=darkred]
    >>> x = u'\xc8'
    >>> print x.encode('ascii ')[/color][/color][/color]
    Traceback (most recent call last):
    File "<stdin>", line 1, in ?
    UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\xc8' in
    position 0: ordinal not in range(128)


    I also tried setting PostgreSQL's client-encoding by executing "SET
    client_encoding TO 'utf-8'", but psycopg still only accepts str-type
    strings (which is not really surprising).


    I assume that the solution will result in an ascii encoded query string,
    and that I then can use the QuotedString type to escape my strings
    (which is in my current situation not possible because that also only
    accepts str type strings and it contains utf-8 characters).

    Regards,
    Alban.
  • Diez B. Roggisch

    #2
    Re: Psycopg and queries with UTF-8 data

    Alban Hertroys wrote:
    [color=blue]
    > I have a query that inserts data originating from an utf-8 encoded XML
    > file. And guess what, it contains utf-8 encoded characters...
    > Now my problem is that psycopg will only accept queries of type str, so
    > how do I get my utf-8 encoded data into the DB?[/color]

    This sounds like the usual unicode/utf-8 confusion: unicode is an abstract
    specification of characters, utf-8 as well as latin1 and ascii are
    encodings of that specification that allow for certain characters to be
    used - namely, ascii for only well-known first 127, latin1 for some major
    european languages, and utf-8 defines escapes for all possible characters
    defined in unicode - with the result that some of the characters aren't one
    byte per character anymore.

    So unicode objects encapsulate abstract unicode character sequence - however
    they accomplish that is not of your concern. strings on the opposite, are
    pure byte sequences - and common libs work with them, with the exception of
    the usually unicode aware xml libs. So to yield a string from an unicode
    object, one has to specify an encoding - like utf-8 or latin1. Now having a
    character in that unicode object that can't be encoded using the specified
    encoding, that will produce an error.


    Please do read a tutorial on unicode and python - there are several good
    ones out there, use google to your advantage.
    [color=blue]
    >
    > I can't do query.encode('a scii'), that would be similar to:[color=green][color=darkred]
    > >>> x = u'\xc8'
    > >>> print x.encode('ascii ')[/color][/color]
    > Traceback (most recent call last):
    > File "<stdin>", line 1, in ?
    > UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\xc8' in
    > position 0: ordinal not in range(128)[/color]

    Sure- xC8 > 127, so it can't be encoded. Do this:
    [color=blue][color=green][color=darkred]
    >>> x = u'\xc8'
    >>> x[/color][/color][/color]
    u'\xc8'[color=blue][color=green][color=darkred]
    >>> x.encode('utf-8')[/color][/color][/color]
    '\xc3\x88'

    As you can see, the formerly one byte long character becomes two bytes. The
    reason is that on unicode character is translated to that 2-byte sequence
    using utf-8.
    [color=blue]
    > I also tried setting PostgreSQL's client-encoding by executing "SET
    > client_encoding TO 'utf-8'", but psycopg still only accepts str-type
    > strings (which is not really surprising).[/color]

    Confusion again - please repeat:

    unicode is not utf-8!!!
    unicode is not utf-8!!!
    unicode is not utf-8!!!
    unicode is not utf-8!!!

    Do encode the unicode object in utf-8, and pass that to the psycopg. If you
    set client_encoding to latin1, you have to encode unicod to that.

    --
    Regards,

    Diez B. Roggisch

    Comment

    • Jarek Zgoda

      #3
      Re: Psycopg and queries with UTF-8 data

      Alban Hertroys <alban@magprodu ctions.nl> pisze:
      [color=blue]
      > I have a query that inserts data originating from an utf-8 encoded XML
      > file. And guess what, it contains utf-8 encoded characters...
      > Now my problem is that psycopg will only accept queries of type str, so
      > how do I get my utf-8 encoded data into the DB?
      >
      > I can't do query.encode('a scii'), that would be similar to:[color=green][color=darkred]
      > >>> x = u'\xc8'
      > >>> print x.encode('ascii ')[/color][/color]
      > Traceback (most recent call last):
      > File "<stdin>", line 1, in ?
      > UnicodeEncodeEr ror: 'ascii' codec can't encode character u'\xc8' in
      > position 0: ordinal not in range(128)[/color]

      Did you try x.encode('utf-8')?

      --
      Jarek Zgoda
      http://jpa.berlios.de/ | http://www.zgodowie.org/

      Comment

      • Alban Hertroys

        #4
        Re: Psycopg and queries with UTF-8 data

        Diez B. Roggisch wrote:[color=blue]
        > Alban Hertroys wrote:[color=green]
        >>I have a query that inserts data originating from an utf-8 encoded XML
        >>file. And guess what, it contains utf-8 encoded characters...
        >>Now my problem is that psycopg will only accept queries of type str, so
        >>how do I get my utf-8 encoded data into the DB?[/color]
        >
        >
        > This sounds like the usual unicode/utf-8 confusion: unicode is an abstract
        > specification of characters, utf-8 as well as latin1 and ascii are
        > encodings of that specification that allow for certain characters to be
        > used - namely, ascii for only well-known first 127, latin1 for some major
        > european languages, and utf-8 defines escapes for all possible characters
        > defined in unicode - with the result that some of the characters aren't one
        > byte per character anymore.[/color]

        Ah, I see now. I _thought_ it was odd that unicode('string ') resulted in
        a unicode object and 'string'.encode ('utf-8') did not. I understand now
        that 'unicode' is data that is actual unicode data, while 'utf-8'
        _encoded_ data is really a string, but with special characters rewritten
        to specify utf-8 escape sequences instead of the actual unicode bytes.

        Thanks for clearing out my confusion.
        [color=blue]
        > Please do read a tutorial on unicode and python - there are several good
        > ones out there, use google to your advantage.[/color]

        I did, though some time ago. Apparently I missed the point being made
        (or forgot about it).
        [color=blue]
        > Confusion again - please repeat:
        >
        > unicode is not utf-8!!!
        > unicode is not utf-8!!!
        > unicode is not utf-8!!!
        > unicode is not utf-8!!![/color]

        while confused():
        print "unicode is not utf-8!!!"
        [color=blue]
        > Do encode the unicode object in utf-8, and pass that to the psycopg. If you
        > set client_encoding to latin1, you have to encode unicod to that.[/color]

        I suppose I won't notice much of that until I read from the DB (which is
        done in PHP mostly), as the data inserted is already an ascii string by
        itself (with escaped utf-8 characters, though). I'll worry about that
        later ;)

        Many thanks,
        Alban.

        Comment

        • Diez B. Roggisch

          #5
          Re: Psycopg and queries with UTF-8 data

          > Ah, I see now. I _thought_ it was odd that unicode('string ') resulted in[color=blue]
          > a unicode object and 'string'.encode ('utf-8') did not. I understand now
          > that 'unicode' is data that is actual unicode data, while 'utf-8'
          > _encoded_ data is really a string, but with special characters rewritten
          > to specify utf-8 escape sequences instead of the actual unicode bytes.[/color]

          Exactly.
          [color=blue]
          >
          > Thanks for clearing out my confusion.[/color]

          Your welcome.
          [color=blue]
          > while confused():
          > print "unicode is not utf-8!!!"[/color]

          Lets hope confused() is True only for a short time, otherwise you'll end up
          with pretty much output...
          [color=blue][color=green]
          >> Do encode the unicode object in utf-8, and pass that to the psycopg. If
          >> you set client_encoding to latin1, you have to encode unicod to that.[/color]
          >
          > I suppose I won't notice much of that until I read from the DB (which is
          > done in PHP mostly), as the data inserted is already an ascii string by
          > itself (with escaped utf-8 characters, though). I'll worry about that
          > later ;)[/color]

          Well, AFAIK php doesn't care about unicode - all it knows are strings as
          byte sequences, plain old C-style. So if you read from it, things should
          work if you set your HTTP header variables correct _and_ other parts of you
          html-page aren't made in a different encoding - so make sure typing them in
          your editor of choice will yield utf-8 data beeing saved.


          --
          Regards,

          Diez B. Roggisch

          Comment

          Working...