adodbapi / string encoding problem

**Alex Martelli** · Jul 18 '05, 02:57 AM

Re: adodbapi / string encoding problem

Achim Domma wrote:
[color=blue]
> Hi,
>
> I read a webpage via urllib2. The result of the 'read' call is of type
> 'str'. This string can be written to disc via
> file('out.html' ,'w').write(htm l). Then I write the string into a Memofield
> in an Access database, using adodbapi. If I read the text back I get a
> unicode string, which can not written to disc via file(...) due to
> encoding problems. How do I have to decode the unicode string to get my
> original data back?[/color]

You have to *EN*-code Unicode into string, with the same way the string
had been *DE*-coded to Unicode originally, in order to be sure to get
the same string back; specifically, you have to use the same *codec*
(which stands for COder-DECoder). I don't know what codec adodbapi is
using (Python's normal default codec is ASCII, which is the "minimum
common denominator" of just about every encoding around -- if adodbapi
hadn't surreptitiously inserted a different codec, it's impossible that
anything would be decoded that might cause problems in encoding it back;-).

Alex

**Peter Otten** · Jul 18 '05, 02:57 AM

Re: adodbapi / string encoding problem

Achim Domma wrote:
[color=blue]
> I read a webpage via urllib2. The result of the 'read' call is of type
> 'str'. This string can be written to disc via
> file('out.html' ,'w').write(htm l). Then I write the string into a Memofield
> in an Access database, using adodbapi. If I read the text back I get a
> unicode string, which can not written to disc via file(...) due to
> encoding problems. How do I have to decode the unicode string to get my
> original data back?[/color]

You have to know the encoding of the original file.

Assuming (1) you had western european characters including the euro sign,
(2) they were correctly translated into unicode and (3) you want them back
that way:
[color=blue][color=green][color=darkred]
>>> s = u"äöüÄÖÜ".encod e("iso-8859-15")
>>> s[/color][/color][/color]
'\xe4\xf6\xfc\x c4\xd6\xdc'[color=blue][color=green][color=darkred]
>>> print s[/color][/color][/color]
äöüÄÖÜ[color=blue][color=green][color=darkred]
>>> type(s)[/color][/color][/color]
<type 'str'>[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]

Or more general:

unicodeFromAcce ss.encode(targe tEncoding)

Peter

**Achim Domma** · Jul 18 '05, 02:57 AM

Re: adodbapi / string encoding problem

"Alex Martelli" <aleax@aleax.it > wrote in message
news:0ZAcb.1189 94$hE5.4097227@ news1.tin.it...[color=blue]
> You have to *EN*-code Unicode into string, with the same way the string
> had been *DE*-coded to Unicode originally, in order to be sure to get
> the same string back; specifically, you have to use the same *codec*[/color]
[...]

Thanks Alex,

I understand that, but looking at the adodbapi code I could not find any
call to encode/decode. The conversion seems to happen somewhere in win32com.
Don't know if you will ever get your data back, once it's converted to
Variant. ;-)

Achim

**Achim Domma** · Jul 18 '05, 02:57 AM

Re: adodbapi / string encoding problem

"Peter Otten" <__peter__@web. de> wrote in message
news:bkumfg$ifj $01$1@news.t-online.com...[color=blue]
> You have to know the encoding of the original file.[/color]

Why? It's of type 'str' and I would expect that I could write it to DB and
get the same 'str' back. That's all I want. Why is it required do know the
encoding?

Achim

**Peter Otten** · Jul 18 '05, 02:57 AM

Re: adodbapi / string encoding problem

Achim Domma wrote:
[color=blue][color=green]
>> You have to know the encoding of the original file.[/color]
>
> Why? It's of type 'str' and I would expect that I could write it to DB and
> get the same 'str' back. That's all I want. Why is it required do know the
> encoding?[/color]

str is essentially a sequence of bytes that can store the same content in
different ways:
[color=blue][color=green][color=darkred]
>>> utf8 = u"ä".encode("ut f8")
>>> latin = u"ä".encode("la tin1")
>>> latin[/color][/color][/color]
'\xe4'[color=blue][color=green][color=darkred]
>>> utf8[/color][/color][/color]
'\xc3\xa4'[color=blue][color=green][color=darkred]
>>>[/color][/color][/color]

Now imagine you store the latter byte sequence in your database and want to
display it in your windows editor
[color=blue][color=green][color=darkred]
>>> print utf8[/color][/color][/color]
Ã¤
(you should see two strange characters)

I had this problem occasionally when I edited python scripts with idle and,
oddly enough, my old c++ builder 3 ide.

To avoid such ambiguities, unicode is introduced. Now I guess that the first
conversion, when your string data is fed to the db api, is performed
automatically using the default encoding of your environment, which may
differ from the encoding of the downloaded file, thus probably messing up
some characters.

Of course you could store the file in binary form (not in a memo field) in
your db and thus bypass all encoding mechanisms, but if you still think
that a string is a string is a string, you should reread the above or
go for more detailed information on the matter.

Peter

**Achim Domma** · Jul 18 '05, 02:57 AM

Re: adodbapi / string encoding problem

"Peter Otten" <__peter__@web. de> wrote in message
news:bkuu57$pc6 $01$1@news.t-online.com...[color=blue]
> str is essentially a sequence of bytes that can store the same content in
> different ways:[/color]

That's clear so far ...
[color=blue]
> Of course you could store the file in binary form (not in a memo field) in
> your db and thus bypass all encoding mechanisms, but if you still think
> that a string is a string is a string, you should reread the above or
> go for more detailed information on the matter.[/color]

.... and that's exactly what I was looking for and what I would expect. My
string is a sequence of bytes, which I want to store in the database. And
exactly that sequence is what I want to have back. The encoding of the data
is stored in an extra column and handling of this information takes place in
another part of the application. But there are poinst where I need the
original data, so it's required for me to save and retrieve the string in
exactly the way I get it from the web.

BTW: How would you save binary data in an Access database? Access knows only
Memo fields or am I wrong?

Achim

**Peter Otten** · Jul 18 '05, 02:57 AM

Re: adodbapi / string encoding problem

Achim Domma wrote:
[color=blue]
> BTW: How would you save binary data in an Access database? Access knows
> only Memo fields or am I wrong?[/color]

CREATE TABLE Bogus (TheFile BINARY);

might do to create the "Bogus" table with a binary "TheFile" field.
As of Access 2000, I think the BINARY datatype is not exposed in the table
designer, so you have to type the SQL into the query designer and then
execute the query.

I have never used it, so the above might or might not work.

Peter

**Dennis Lee Bieber** · Jul 18 '05, 02:57 AM

Re: adodbapi / string encoding problem

Achim Domma fed this fish to the penguins on Thursday 25 September 2003
04:52 am:
[color=blue]
>
> Memofield in an Access database, using adodbapi. If I read the text
> back I get a unicode string, which can not written to disc via
> file(...) due to encoding problems. How do I have to decode the
> unicode string to get my original data back?
>[/color]
I suspect you are running on an NT-family machine. As I recall, NT
uses unicode internally, whereas the W9x-family still used ASCII. Many
of the system calls have variations with an "A" at the end of the name
to emphasize the use of ASCII data.

The conversion to unicode is probably being performed by the JET
engine on writes -- by detecting the lack of a unicode prefix, maybe?
However, retrieval is probably using the non-A system calls, leaving
the data in unicode (on unicode OS, on W9x it likely stays ASCII in
both directions).

Suspect you'll need to determine what unicode encoding is used by
Windows.

--[color=blue]
> =============== =============== =============== =============== == <
> wlfraed@ix.netc om.com | Wulfraed Dennis Lee Bieber KD6MOG <
> wulfraed@dm.net | Bestiaria Support Staff <
> =============== =============== =============== =============== == <
> Bestiaria Home Page: http://www.beastie.dm.net/ <
> Home Page: http://www.dm.net/~wulfraed/ <[/color]

**Alex Martelli** · Jul 18 '05, 02:59 AM

Re: adodbapi / string encoding problem

Achim Domma wrote:
[color=blue]
> "Peter Otten" <__peter__@web. de> wrote in message
> news:bkumfg$ifj $01$1@news.t-online.com...[color=green]
>> You have to know the encoding of the original file.[/color]
>
> Why? It's of type 'str' and I would expect that I could write it to DB and
> get the same 'str' back. That's all I want. Why is it required do know the
> encoding?[/color]

Because the Access engine (actually known as Microsoft Jet: "Access" is
only, strictly a *FRONT-END* product -- marketing terminology confusion)
stores all text strings as Unicode; and COM (thus ADO) also uses Unicode
exclusively for all text strings (as a rule). If you cannot move to
better engines and interfaces, you're stuck with the ones you have...
(99 times out of 100, moving to better engines and interfaces -- e.g.
SQLite and PySQLite, or Firebird, etc, is preferable from most points
of view -- but 1% of the time one must keep supporting legacy code...).

Alex

**Alex Martelli** · Jul 18 '05, 02:59 AM

Re: adodbapi / string encoding problem

Achim Domma wrote:
[color=blue]
> "Alex Martelli" <aleax@aleax.it > wrote in message
> news:0ZAcb.1189 94$hE5.4097227@ news1.tin.it...[color=green]
>> You have to *EN*-code Unicode into string, with the same way the string
>> had been *DE*-coded to Unicode originally, in order to be sure to get
>> the same string back; specifically, you have to use the same *codec*[/color]
> [...]
>
> Thanks Alex,
>
> I understand that, but looking at the adodbapi code I could not find any
> call to encode/decode. The conversion seems to happen somewhere in
> win32com. Don't know if you will ever get your data back, once it's
> converted to Variant. ;-)[/color]

So, take control of your destiny: since you know you're using tools
that can only deal with Unicode (and thus will inevitably convert --
in ways that perhaps you don't know -- if you pass them bytestrings),
preempt their "unknown and unwanted" conversion by doing a Unicode
conversion yourself in ways you DO know and control. UTF-16 sticks
2 bytes into each Unicode character -- you do need to be working with
strings of EVEN length, though. Or else you can use, e.g., ISO-8859-1,
and resign yourself to spending one Unicode character per byte in
your "real" byte-string.

Or else, of course, you can use a "BLOB" field instead of a text
one; I think the keyword for that in the Jet engine's DDL SQL is
BINARY. If you DO need to use Access to manipulate your db, though
(and I can see deucedly few other reasons to use a Jet engine...),
I think that might not work -- at least back when I was having to
work on MS platform, I seem to recall that Access could not truly
support BLOB fields (except perhaps with embedded SQL, but that was
not considered acceptable in most Access-addicted shops, since the
real reason to use Access was NOT having to understand SQL...;-).

Alex

adodbapi / string encoding problem

adodbapi / string encoding problem

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment