adodbapi / string encoding problem

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Achim Domma

    adodbapi / string encoding problem

    Hi,

    I read a webpage via urllib2. The result of the 'read' call is of type
    'str'. This string can be written to disc via
    file('out.html' ,'w').write(htm l). Then I write the string into a Memofield
    in an Access database, using adodbapi. If I read the text back I get a
    unicode string, which can not written to disc via file(...) due to encoding
    problems. How do I have to decode the unicode string to get my original data
    back?

    regards,
    Achim


  • Alex Martelli

    #2
    Re: adodbapi / string encoding problem

    Achim Domma wrote:
    [color=blue]
    > Hi,
    >
    > I read a webpage via urllib2. The result of the 'read' call is of type
    > 'str'. This string can be written to disc via
    > file('out.html' ,'w').write(htm l). Then I write the string into a Memofield
    > in an Access database, using adodbapi. If I read the text back I get a
    > unicode string, which can not written to disc via file(...) due to
    > encoding problems. How do I have to decode the unicode string to get my
    > original data back?[/color]

    You have to *EN*-code Unicode into string, with the same way the string
    had been *DE*-coded to Unicode originally, in order to be sure to get
    the same string back; specifically, you have to use the same *codec*
    (which stands for COder-DECoder). I don't know what codec adodbapi is
    using (Python's normal default codec is ASCII, which is the "minimum
    common denominator" of just about every encoding around -- if adodbapi
    hadn't surreptitiously inserted a different codec, it's impossible that
    anything would be decoded that might cause problems in encoding it back;-).


    Alex

    Comment

    • Peter Otten

      #3
      Re: adodbapi / string encoding problem

      Achim Domma wrote:
      [color=blue]
      > I read a webpage via urllib2. The result of the 'read' call is of type
      > 'str'. This string can be written to disc via
      > file('out.html' ,'w').write(htm l). Then I write the string into a Memofield
      > in an Access database, using adodbapi. If I read the text back I get a
      > unicode string, which can not written to disc via file(...) due to
      > encoding problems. How do I have to decode the unicode string to get my
      > original data back?[/color]

      You have to know the encoding of the original file.

      Assuming (1) you had western european characters including the euro sign,
      (2) they were correctly translated into unicode and (3) you want them back
      that way:
      [color=blue][color=green][color=darkred]
      >>> s = u"äöüÄÖÜ".encod e("iso-8859-15")
      >>> s[/color][/color][/color]
      '\xe4\xf6\xfc\x c4\xd6\xdc'[color=blue][color=green][color=darkred]
      >>> print s[/color][/color][/color]
      äöüÄÖÜ[color=blue][color=green][color=darkred]
      >>> type(s)[/color][/color][/color]
      <type 'str'>[color=blue][color=green][color=darkred]
      >>>[/color][/color][/color]

      Or more general:

      unicodeFromAcce ss.encode(targe tEncoding)

      Peter

      Comment

      • Achim Domma

        #4
        Re: adodbapi / string encoding problem

        "Alex Martelli" <aleax@aleax.it > wrote in message
        news:0ZAcb.1189 94$hE5.4097227@ news1.tin.it...[color=blue]
        > You have to *EN*-code Unicode into string, with the same way the string
        > had been *DE*-coded to Unicode originally, in order to be sure to get
        > the same string back; specifically, you have to use the same *codec*[/color]
        [...]

        Thanks Alex,

        I understand that, but looking at the adodbapi code I could not find any
        call to encode/decode. The conversion seems to happen somewhere in win32com.
        Don't know if you will ever get your data back, once it's converted to
        Variant. ;-)

        Achim


        Comment

        • Achim Domma

          #5
          Re: adodbapi / string encoding problem

          "Peter Otten" <__peter__@web. de> wrote in message
          news:bkumfg$ifj $01$1@news.t-online.com...[color=blue]
          > You have to know the encoding of the original file.[/color]

          Why? It's of type 'str' and I would expect that I could write it to DB and
          get the same 'str' back. That's all I want. Why is it required do know the
          encoding?

          Achim


          Comment

          • Peter Otten

            #6
            Re: adodbapi / string encoding problem

            Achim Domma wrote:
            [color=blue][color=green]
            >> You have to know the encoding of the original file.[/color]
            >
            > Why? It's of type 'str' and I would expect that I could write it to DB and
            > get the same 'str' back. That's all I want. Why is it required do know the
            > encoding?[/color]

            str is essentially a sequence of bytes that can store the same content in
            different ways:
            [color=blue][color=green][color=darkred]
            >>> utf8 = u"ä".encode("ut f8")
            >>> latin = u"ä".encode("la tin1")
            >>> latin[/color][/color][/color]
            '\xe4'[color=blue][color=green][color=darkred]
            >>> utf8[/color][/color][/color]
            '\xc3\xa4'[color=blue][color=green][color=darkred]
            >>>[/color][/color][/color]

            Now imagine you store the latter byte sequence in your database and want to
            display it in your windows editor
            [color=blue][color=green][color=darkred]
            >>> print utf8[/color][/color][/color]
            ä
            (you should see two strange characters)

            I had this problem occasionally when I edited python scripts with idle and,
            oddly enough, my old c++ builder 3 ide.

            To avoid such ambiguities, unicode is introduced. Now I guess that the first
            conversion, when your string data is fed to the db api, is performed
            automatically using the default encoding of your environment, which may
            differ from the encoding of the downloaded file, thus probably messing up
            some characters.

            Of course you could store the file in binary form (not in a memo field) in
            your db and thus bypass all encoding mechanisms, but if you still think
            that a string is a string is a string, you should reread the above or
            go for more detailed information on the matter.

            Peter


            Comment

            • Achim Domma

              #7
              Re: adodbapi / string encoding problem

              "Peter Otten" <__peter__@web. de> wrote in message
              news:bkuu57$pc6 $01$1@news.t-online.com...[color=blue]
              > str is essentially a sequence of bytes that can store the same content in
              > different ways:[/color]

              That's clear so far ...
              [color=blue]
              > Of course you could store the file in binary form (not in a memo field) in
              > your db and thus bypass all encoding mechanisms, but if you still think
              > that a string is a string is a string, you should reread the above or
              > go for more detailed information on the matter.[/color]

              .... and that's exactly what I was looking for and what I would expect. My
              string is a sequence of bytes, which I want to store in the database. And
              exactly that sequence is what I want to have back. The encoding of the data
              is stored in an extra column and handling of this information takes place in
              another part of the application. But there are poinst where I need the
              original data, so it's required for me to save and retrieve the string in
              exactly the way I get it from the web.

              BTW: How would you save binary data in an Access database? Access knows only
              Memo fields or am I wrong?

              Achim


              Comment

              • Peter Otten

                #8
                Re: adodbapi / string encoding problem

                Achim Domma wrote:
                [color=blue]
                > BTW: How would you save binary data in an Access database? Access knows
                > only Memo fields or am I wrong?[/color]

                CREATE TABLE Bogus (TheFile BINARY);

                might do to create the "Bogus" table with a binary "TheFile" field.
                As of Access 2000, I think the BINARY datatype is not exposed in the table
                designer, so you have to type the SQL into the query designer and then
                execute the query.

                I have never used it, so the above might or might not work.

                Peter

                Comment

                • Dennis Lee Bieber

                  #9
                  Re: adodbapi / string encoding problem

                  Achim Domma fed this fish to the penguins on Thursday 25 September 2003
                  04:52 am:
                  [color=blue]
                  >
                  > Memofield in an Access database, using adodbapi. If I read the text
                  > back I get a unicode string, which can not written to disc via
                  > file(...) due to encoding problems. How do I have to decode the
                  > unicode string to get my original data back?
                  >[/color]
                  I suspect you are running on an NT-family machine. As I recall, NT
                  uses unicode internally, whereas the W9x-family still used ASCII. Many
                  of the system calls have variations with an "A" at the end of the name
                  to emphasize the use of ASCII data.

                  The conversion to unicode is probably being performed by the JET
                  engine on writes -- by detecting the lack of a unicode prefix, maybe?
                  However, retrieval is probably using the non-A system calls, leaving
                  the data in unicode (on unicode OS, on W9x it likely stays ASCII in
                  both directions).

                  Suspect you'll need to determine what unicode encoding is used by
                  Windows.

                  --[color=blue]
                  > =============== =============== =============== =============== == <
                  > wlfraed@ix.netc om.com | Wulfraed Dennis Lee Bieber KD6MOG <
                  > wulfraed@dm.net | Bestiaria Support Staff <
                  > =============== =============== =============== =============== == <
                  > Bestiaria Home Page: http://www.beastie.dm.net/ <
                  > Home Page: http://www.dm.net/~wulfraed/ <[/color]

                  Comment

                  • Alex Martelli

                    #10
                    Re: adodbapi / string encoding problem

                    Achim Domma wrote:
                    [color=blue]
                    > "Peter Otten" <__peter__@web. de> wrote in message
                    > news:bkumfg$ifj $01$1@news.t-online.com...[color=green]
                    >> You have to know the encoding of the original file.[/color]
                    >
                    > Why? It's of type 'str' and I would expect that I could write it to DB and
                    > get the same 'str' back. That's all I want. Why is it required do know the
                    > encoding?[/color]

                    Because the Access engine (actually known as Microsoft Jet: "Access" is
                    only, strictly a *FRONT-END* product -- marketing terminology confusion)
                    stores all text strings as Unicode; and COM (thus ADO) also uses Unicode
                    exclusively for all text strings (as a rule). If you cannot move to
                    better engines and interfaces, you're stuck with the ones you have...
                    (99 times out of 100, moving to better engines and interfaces -- e.g.
                    SQLite and PySQLite, or Firebird, etc, is preferable from most points
                    of view -- but 1% of the time one must keep supporting legacy code...).


                    Alex

                    Comment

                    • Alex Martelli

                      #11
                      Re: adodbapi / string encoding problem

                      Achim Domma wrote:
                      [color=blue]
                      > "Alex Martelli" <aleax@aleax.it > wrote in message
                      > news:0ZAcb.1189 94$hE5.4097227@ news1.tin.it...[color=green]
                      >> You have to *EN*-code Unicode into string, with the same way the string
                      >> had been *DE*-coded to Unicode originally, in order to be sure to get
                      >> the same string back; specifically, you have to use the same *codec*[/color]
                      > [...]
                      >
                      > Thanks Alex,
                      >
                      > I understand that, but looking at the adodbapi code I could not find any
                      > call to encode/decode. The conversion seems to happen somewhere in
                      > win32com. Don't know if you will ever get your data back, once it's
                      > converted to Variant. ;-)[/color]

                      So, take control of your destiny: since you know you're using tools
                      that can only deal with Unicode (and thus will inevitably convert --
                      in ways that perhaps you don't know -- if you pass them bytestrings),
                      preempt their "unknown and unwanted" conversion by doing a Unicode
                      conversion yourself in ways you DO know and control. UTF-16 sticks
                      2 bytes into each Unicode character -- you do need to be working with
                      strings of EVEN length, though. Or else you can use, e.g., ISO-8859-1,
                      and resign yourself to spending one Unicode character per byte in
                      your "real" byte-string.

                      Or else, of course, you can use a "BLOB" field instead of a text
                      one; I think the keyword for that in the Jet engine's DDL SQL is
                      BINARY. If you DO need to use Access to manipulate your db, though
                      (and I can see deucedly few other reasons to use a Jet engine...),
                      I think that might not work -- at least back when I was having to
                      work on MS platform, I seem to recall that Access could not truly
                      support BLOB fields (except perhaps with embedded SQL, but that was
                      not considered acceptable in most Access-addicted shops, since the
                      real reason to use Access was NOT having to understand SQL...;-).


                      Alex

                      Comment

                      Working...