Unicode / cx_Oracle problem

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Richard Schulman

    Unicode / cx_Oracle problem

    Sorry to be back at the goodly well so soon, but...

    ....when I execute the following -- variable mean_eng_txt being
    utf-16LE and its datatype nvarchar2(79) in Oracle:

    cursor.execute( """INSERT INTO mean (mean_id,mean_e ng_txt)
    VALUES (:id,:mean)""", id=id,mean=mean )

    I not surprisingly get this error message:

    "cx_Oracle.NotS upportedError: Variable_TypeBy Value(): unhandled data
    type unicode"

    But when I try putting a codecs.BOM_UTF1 6_LE in various plausible
    places, I just end up generating different errors.

    Recommendations , please?

    TIA,
    Richard Schulman
    (Remove xx for email reply)
  • Diez B. Roggisch

    #2
    Re: Unicode / cx_Oracle problem

    Richard Schulman schrieb:
    Sorry to be back at the goodly well so soon, but...
    >
    ...when I execute the following -- variable mean_eng_txt being
    utf-16LE and its datatype nvarchar2(79) in Oracle:
    >
    cursor.execute( """INSERT INTO mean (mean_id,mean_e ng_txt)
    VALUES (:id,:mean)""", id=id,mean=mean )
    >
    I not surprisingly get this error message:
    >
    "cx_Oracle.NotS upportedError: Variable_TypeBy Value(): unhandled data
    type unicode"
    >
    But when I try putting a codecs.BOM_UTF1 6_LE in various plausible
    places, I just end up generating different errors.
    Show us the alleged plausible places, and the different errors.
    Otherwise it's crystal ball time again.

    Diez

    Comment

    • Richard Schulman

      #3
      Re: Unicode / cx_Oracle problem

      > cursor.execute( """INSERT INTO mean (mean_id,mean_e ng_txt)
      > VALUES (:id,:mean)""", id=id,mean=mean )
      >>...
      > "cx_Oracle.NotS upportedError: Variable_TypeBy Value(): unhandled data
      >type unicode"
      >>
      >But when I try putting a codecs.BOM_UTF1 6_LE in various plausible
      >places, I just end up generating different errors.
      Diez:
      >Show us the alleged plausible places, and the different errors.
      >Otherwise it's crystal ball time again.
      More usefully, let's just try to fix the code above. Here's the error
      message I get:

      NotSupportedErr or: Variable_TypeBy Value(): unhandled data type unicode

      Traceback (innermost last):

      File "c:\pythonapps\ LoadMeanToOra.p y", line 1, in ?
      # LoadMeanToOra reads a UTF-16LE input file one record at a time
      File "c:\pythonapps\ LoadMeanToOra.p y", line 23, in ?
      cursor.execute( """INSERT INTO mean (mean_id,mean_e ng_txt)

      What I can't figure out is whether cx_Oracle is saying it can't handle
      Unicode for an Oracle nvarchar2 data type or whether it can handle the
      input but that it needs to be in a specific format that I'm not
      supplying.

      - Richard Schulman

      Comment

      • Diez B. Roggisch

        #4
        Re: Unicode / cx_Oracle problem

        Richard Schulman schrieb:
        >> cursor.execute( """INSERT INTO mean (mean_id,mean_e ng_txt)
        >> VALUES (:id,:mean)""", id=id,mean=mean )
        >>...
        >> "cx_Oracle.NotS upportedError: Variable_TypeBy Value(): unhandled data
        >>type unicode"
        >>>
        >>But when I try putting a codecs.BOM_UTF1 6_LE in various plausible
        >>places, I just end up generating different errors.
        >
        Diez:
        >Show us the alleged plausible places, and the different errors.
        >Otherwise it's crystal ball time again.
        >
        More usefully, let's just try to fix the code above. Here's the error
        message I get:
        >
        NotSupportedErr or: Variable_TypeBy Value(): unhandled data type unicode
        >
        Traceback (innermost last):
        >
        File "c:\pythonapps\ LoadMeanToOra.p y", line 1, in ?
        # LoadMeanToOra reads a UTF-16LE input file one record at a time
        File "c:\pythonapps\ LoadMeanToOra.p y", line 23, in ?
        cursor.execute( """INSERT INTO mean (mean_id,mean_e ng_txt)
        >
        What I can't figure out is whether cx_Oracle is saying it can't handle
        Unicode for an Oracle nvarchar2 data type or whether it can handle the
        input but that it needs to be in a specific format that I'm not
        supplying.
        What does

        print repr(mean)

        give you?

        It _looks_ to me (don't have an orcacle available right now) as if it is
        a unicode object. That you have to consider as some abstract string
        representation. Which means it has to be encoded in some way before sent
        over the wire. There might exist db-api bindings that can deal with
        them, by applying a default encoding or somehow figuring out what
        encoding the DB expects. But I don't see any references to unicode in
        pep 249, so I presume you can't rely on that - which seems to be the
        case here.

        The oracle NLS is a sometimes tricky beast, as it sets the encoding it
        tries to be clever and assigns an existing connection some encoding,
        based on the users/machines locale. Which can yield unexpected results,
        such as "Dusseldorf " instead of "Düsseldorf " when querying a german city
        list with an english locale.

        So - you have to figure out, what encoding your db-connection expects.
        You can do so by issuing some queries against the session tables I
        believe - I don't have my oracle resources at home, but googling will
        bring you there, the important oracle term is NLS.

        Then you need to encode the unicode string before passing it - something
        like this:

        mean = mean.encode("la tin1")

        That should help.

        Diez

        Comment

        • Richard Schulman

          #5
          Re: Unicode / cx_Oracle problem

          On Sun, 10 Sep 2006 11:42:26 +0200, "Diez B. Roggisch"
          <deets@nospam.w eb.dewrote:
          >What does print repr(mean) give you?
          That is a useful suggestion.

          For context, I reproduce the source code:

          in_file = codecs.open("c: \\pythonapps\\m ean.my",encodin g="utf_16_LE" )
          connection = cx_Oracle.conne ct("username", "password")
          cursor = connection.curs or()
          for row in in_file:
          id = row[0]
          mean = row[1]
          print "Value of row is ", repr(row) #debug line
          print "Value of the variable 'id' is ", repr(id) #debug line
          print "Value of the variable 'mean' is ", repr(mean) #debug line
          cursor.execute( """INSERT INTO mean (mean_id,mean_e ng_txt)
          VALUES (:id,:mean)""", id=id,mean=mean )

          Here is the result from the print repr() statements:

          Value of row is u"\ufeff(3,'sad ness, lament; sympathize with,
          pity')\r\n"
          Value of the variable 'id' is u'\ufeff'
          Value of the variable 'mean' is u'('

          Clearly, the values loaded into the 'id' and 'mean' variables are not
          satisfactory but are picking up the BOM.
          >...
          >The oracle NLS is a sometimes tricky beast, as it sets the encoding it
          >tries to be clever and assigns an existing connection some encoding,
          >based on the users/machines locale. Which can yield unexpected results,
          >such as "Dusseldorf " instead of "Düsseldorf " when querying a german city
          >list with an english locale.
          Agreed.
          >So - you have to figure out, what encoding your db-connection expects.
          >You can do so by issuing some queries against the session tables I
          >believe - I don't have my oracle resources at home, but googling will
          >bring you there, the important oracle term is NLS.
          It's very hard to figure out what to do on the basis of complexities
          on the order of



          (tiny equivalent http://tinyurl.com/fnc54

          But I'm not even sure I got that far. My problems so far seem prior:
          in Python or Python's cx_Oracle driver. To be candid, I'm very tempted
          at this point to abandon the Python effort and revert to an all-ucs2
          environment, much as I dislike Java and C#'s complexities and the poor
          support available for all-Java databases.
          >Then you need to encode the unicode string before passing it - something
          >like this:
          >
          >mean = mean.encode("la tin1")
          I don't see how the Chinese characters embedded in the English text
          will carry over if I do that.

          In any case, thanks for your patient and generous help.

          Richard Schulman
          Delete the antispamming 'xx' characters for email reply

          Comment

          • Diez B. Roggisch

            #6
            Re: Unicode / cx_Oracle problem

            Value of the variable 'id' is u'\ufeff'
            Value of the variable 'mean' is u'('
            So they both are unicode objects - as I presumed.

            It's very hard to figure out what to do on the basis of complexities
            on the order of
            >

            >
            (tiny equivalent http://tinyurl.com/fnc54
            Yes, that is somewhat intimidating.
            But I'm not even sure I got that far. My problems so far seem prior:
            in Python or Python's cx_Oracle driver. To be candid, I'm very tempted
            at this point to abandon the Python effort and revert to an all-ucs2
            environment, much as I dislike Java and C#'s complexities and the poor
            support available for all-Java databases.
            That actually doesn't help you much I guess - just because JDBC will
            convert java's unicode strings to byte strings behind the curtains, you
            will lose all encoding information nonetheless - especially if the DB
            itself isn't running an encoding that will allow for all possible
            unicode characters to be represented.
            >Then you need to encode the unicode string before passing it - something
            >like this:
            >>
            >mean = mean.encode("la tin1")
            >
            I don't see how the Chinese characters embedded in the English text
            will carry over if I do that.
            Me neither, but how could I have foreseen that? So use something else
            instead - utf-8 for example, or whatever the oracle connection will grok.

            I think you should read up on what unicode and encodings are, and how
            they work in python, and unfortunately how they do work in oracle.
            Because even if you use java - not understanding how things are
            connected will hit you in the neck at some point.

            Diez

            Comment

            • John Machin

              #7
              Re: Unicode / cx_Oracle problem

              Richard Schulman wrote:
              On Sun, 10 Sep 2006 11:42:26 +0200, "Diez B. Roggisch"
              <deets@nospam.w eb.dewrote:
              >
              What does print repr(mean) give you?
              >
              That is a useful suggestion.
              >
              For context, I reproduce the source code:
              >
              in_file = codecs.open("c: \\pythonapps\\m ean.my",encodin g="utf_16_LE" )
              connection = cx_Oracle.conne ct("username", "password")
              cursor = connection.curs or()
              for row in in_file:
              id = row[0]
              mean = row[1]
              print "Value of row is ", repr(row) #debug line
              print "Value of the variable 'id' is ", repr(id) #debug line
              print "Value of the variable 'mean' is ", repr(mean) #debug line
              cursor.execute( """INSERT INTO mean (mean_id,mean_e ng_txt)
              VALUES (:id,:mean)""", id=id,mean=mean )
              >
              Here is the result from the print repr() statements:
              >
              Value of row is u"\ufeff(3,'sad ness, lament; sympathize with,
              pity')\r\n"
              Value of the variable 'id' is u'\ufeff'
              Value of the variable 'mean' is u'('
              >
              Clearly, the values loaded into the 'id' and 'mean' variables are not
              satisfactory but are picking up the BOM.
              Well of course they're "unsatisfactory " and this is absolutely nothing
              to do with Oracle and cx_Oracle.

              row is a string of characters. row[0] is the BOM. Read my lips (from a
              previous thread):

              """
              Use utf_16 -- it will strip off the BOM for you.
              """
              and again:
              """
              | >>codecs.open(' guff.utf16le', 'r', encoding='utf_1 6').read()
              | u'abc\n\rdef\n\ rghi' ######### Look, Mom, no BOM!
              """

              row[1] is the first ***character*** of what looks suspiciously like the
              Python representation of a tuple:

              """(3,'sadn ess, lament; sympathize with, pity')"""

              Who wrote that like that??? If it is at all under your control, do it
              like this:
              Encode each Unicode text field in UTF-8. Write the file as a CSV file
              using Python's csv module. Read the CSV file using the same module.
              Decode the text fields from UTF-8.

              You need to parse the incoming line into column values (the csv module
              does this for you) and then convert each column value from
              string/Unicode to a Python type that is compatible with the Oracle type
              for that column.

              My guess (not having used cx_Oracle) is that the error is happening
              because the column "id" has a numeric type and you are trying to jam a
              Unicode string into it. IOW, nothing to do with the "mean" column
              (yet!).

              BTW, I've managed to decode that "eng" means English not engineering
              and "mean" means meaning i.e. not average and not stingy. Holy
              obfuscation, Batman!

              HTH,
              John

              Comment

              • Richard Schulman

                #8
                Re: Unicode / cx_Oracle problem

                On 10 Sep 2006 15:27:17 -0700, "John Machin" <sjmachin@lexic on.net>
                wrote:
                >...
                >Encode each Unicode text field in UTF-8. Write the file as a CSV file
                >using Python's csv module. Read the CSV file using the same module.
                >Decode the text fields from UTF-8.
                >
                >You need to parse the incoming line into column values (the csv module
                >does this for you) and then convert each column value from
                >string/Unicode to a Python type that is compatible with the Oracle type
                >for that column.
                >...
                John, how am I to reconcile your suggestions above with my
                ActivePython 2.4 documentation, which states:

                <<12.20 csv -- CSV File Reading and Writing
                <<New in version 2.3.
                ....
                <<Note: This version of the csv module doesn't support Unicode input.
                Also, there are currently some issues regarding ASCII NUL characters.
                Accordingly, all input should generally be printable ASCII to be safe.
                These restrictions will be removed in the future.>>

                Regards,
                Richard Schulman

                Comment

                • John Machin

                  #9
                  Re: Unicode / cx_Oracle problem

                  Richard Schulman wrote:
                  On 10 Sep 2006 15:27:17 -0700, "John Machin" <sjmachin@lexic on.net>
                  wrote:
                  >
                  ...
                  Encode each Unicode text field in UTF-8. Write the file as a CSV file
                  using Python's csv module. Read the CSV file using the same module.
                  Decode the text fields from UTF-8.

                  You need to parse the incoming line into column values (the csv module
                  does this for you) and then convert each column value from
                  string/Unicode to a Python type that is compatible with the Oracle type
                  for that column.
                  ...
                  >
                  John, how am I to reconcile your suggestions above with my
                  ActivePython 2.4 documentation, which states:
                  >
                  <<12.20 csv -- CSV File Reading and Writing
                  <<New in version 2.3.
                  ...
                  <<Note: This version of the csv module doesn't support Unicode input.
                  Also, there are currently some issues regarding ASCII NUL characters.
                  Accordingly, all input should generally be printable ASCII to be safe.
                  These restrictions will be removed in the future.>>
                  >
                  1. For "Unicode" read "UTF-16".

                  2. Unless you have \u0000 in your Unicode data, encoding it into UTF-8
                  won't cause any ASCII NUL bytes to appear. Ensuring that you don't have
                  NULs in your data is a good idea in general.

                  3. There are also evidently some issues regarding ASCII LF characters
                  embedded in fields (like when Excel users do Alt-Enter (Windows
                  version) to put a hard line break in their headings); see
                  http://docs.python.org/dev/whatsnew/modules.html of which the following
                  is an extract:
                  """
                  The CSV parser is now stricter about multi-line quoted fields.
                  Previously, if a line ended within a quoted field without a terminating
                  newline character, a newline would be inserted into the returned field.
                  This behavior caused problems when reading files that contained
                  carriage return characters within fields, so the code was changed to
                  return the field without inserting newlines. As a consequence, if
                  newlines embedded within fields are important, the input should be
                  split into lines in a manner that preserves the newline characters.
                  """

                  4. Provided your fields don't contain any of CR, LF, ctrl-Z (maybe),
                  and NUL, you should be OK. I can't understand the sentence
                  "Accordingl y, all input should generally be printable ASCII to be
                  safe." -- especially the "accordingl y". If it was running amok with
                  8-bit characters with ord(c) >= 128, there would have been loud shrieks
                  from several corners of the globe.

                  5. However, to be safe, you could go the next step and convert the
                  UTF-8 to base64 -- see
                  http://docs.python.org/dev/lib/module-base64.html -- BUT once you've
                  done that your encoded data doesn't even have commas and quotes in it,
                  so you can avoid the maybe unsafe csv module and just write your data
                  as ",".join(base64 _encoded_fields ).

                  HTH,
                  John



                  HTH,
                  John

                  Comment

                  Working...