Unicode/utf-8 data in SQL Server

**Neil Hodgson** · Aug 8 '06, 11:55 PM

Re: Unicode/utf-8 data in SQL Server

bjorn:

I also
tried creating an Excel VB macro using the StrConv function, but (a) it
isn't entirely clear to me that this function can do this, and (b) the
third argument to the function is an LCID, a Locale ID, which is
numeric and not defined anywhere I can find it...

http://www.microsoft.com/globaldev/reference/lcid-all.mspx

Neil

**John Machin** · Aug 9 '06, 12:25 AM

Re: Unicode/utf-8 data in SQL Server

thebjorn wrote:

I'm working with a MS SQL Server database created by a program from a
fine US company who seems to have gotten run over by the Unicode truck.
In their infinite wisdom they've decided to store Unicode data directly
in regular varchar fields, utf-8 encoded! (on the bright side, it is
properly utf-8 encoded). One of our customers then wants to use a csv
file created from a report to import in Excel and is getting an
attitude when the text shows up "all garbled" (which I can
understand...)
>
One method that works is to use Python to pull down the result set from
the database, accumulate the entire result text as a big unicode string
(while decode('utf-8') all text fields in the process) separating each
field with a tab, before encode('utf-16') the result string and writing
it to a file opened in binary mode. This ensures that the file gets a
bom, that it's in a format (utf-16) that Excel can import, and
hopefully tabs are less common than commas in the source data :-( The
csv module doesn't support Unicode.

Last time I looked, *Excel* didn't support csv files in utf-N :-(

>
The customer is of the firm belief that our national characters
(æøå) are part of ascii, presumably because they're
single-byte-encoded in iso-8859-1. He has no understanding for the
issues (either by choice or experience) so there is no purpose to
trying to explain the differences... Be that as it may, he might be
satisfied with a csv file in that (iso-8859-1) encoding since the local
version of Excel can import it transparently (with significant
behind-the-scenes magic I believe...?)

No magic AFAICT. The bog-standard Windows kit in (north/west/south
Europe + the English-speaking world) uses code page 1252 (Python:
'cp1252') which is an MS-molested iso-885-1.

The customer should be very happy if you do
text.decode('ut f-8').encode('cp1 252') -- not only should the file
import into Excel OK, he should be able to view it in
Word/Notepad/whatever.

HTH,
John

**Laurent Pointal** · Aug 9 '06, 07:35 AM

Re: Unicode/utf-8 data in SQL Server

John Machin a écrit :

The customer should be very happy if you do
text.decode('ut f-8').encode('cp1 252') -- not only should the file
import into Excel OK, he should be able to view it in
Word/Notepad/whatever.

+
text.decode('ut f-8').encode('cp1 252',errors='re place')

As cp1252 may not cover all utf8 chars.

Laurent.

**John Machin** · Aug 9 '06, 09:15 AM

Re: Unicode/utf-8 data in SQL Server

Laurent Pointal wrote:

John Machin a écrit :

The customer should be very happy if you do
text.decode('ut f-8').encode('cp1 252') -- not only should the file
import into Excel OK, he should be able to view it in
Word/Notepad/whatever.

>
+
text.decode('ut f-8').encode('cp1 252',errors='re place')
>
As cp1252 may not cover all utf8 chars.

In that case, the OP may well want to use 'xmlcharrefrepl ace' or
'backslashrepla ce' as they stand out more than '?' *and* the original
Unicode is recoverable if necessary e.g.:

#>>msg = u'\u0124\u0114\ u0139\u013B\u01 50'

>>print msg

HELLO
#>>msg.encode(' cp1252', 'replace')
'?????'
#>>msg.encode(' cp1252', 'xmlcharrefrepl ace')
'ĤĔĹĻŐ'
#>>msg.encode(' cp1252', 'backslashrepla ce')
'\\u0124\\u0114 \\u0139\\u013b\ \u0150'
#>>

Cheers,
John

Unicode/utf-8 data in SQL Server

Unicode/utf-8 data in SQL Server

Comment

Comment

Comment

Comment