Unicode/utf-8 data in SQL Server

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • thebjorn

    Unicode/utf-8 data in SQL Server

    I'm working with a MS SQL Server database created by a program from a
    fine US company who seems to have gotten run over by the Unicode truck.
    In their infinite wisdom they've decided to store Unicode data directly
    in regular varchar fields, utf-8 encoded! (on the bright side, it is
    properly utf-8 encoded). One of our customers then wants to use a csv
    file created from a report to import in Excel and is getting an
    attitude when the text shows up "all garbled" (which I can
    understand...)

    One method that works is to use Python to pull down the result set from
    the database, accumulate the entire result text as a big unicode string
    (while decode('utf-8') all text fields in the process) separating each
    field with a tab, before encode('utf-16') the result string and writing
    it to a file opened in binary mode. This ensures that the file gets a
    bom, that it's in a format (utf-16) that Excel can import, and
    hopefully tabs are less common than commas in the source data :-( The
    csv module doesn't support Unicode.

    The customer is of the firm belief that our national characters
    (æøå) are part of ascii, presumably because they're
    single-byte-encoded in iso-8859-1. He has no understanding for the
    issues (either by choice or experience) so there is no purpose to
    trying to explain the differences... Be that as it may, he might be
    satisfied with a csv file in that (iso-8859-1) encoding since the local
    version of Excel can import it transparently (with significant
    behind-the-scenes magic I believe...?)

    The Python script mentioned above has to be run on the server, since it
    doesn't accept remote connections, I'm of course the only one with
    access, and I'd like to remove myself from the loop. I've looked at
    creating a view on the database that would cast or convert the data,
    but all I've run into are vague references to StrConv, which seems to
    be a VB function. Giving the customer a macro that he could run in
    Excel after importing the data would probably be ok as well, so I also
    tried creating an Excel VB macro using the StrConv function, but (a) it
    isn't entirely clear to me that this function can do this, and (b) the
    third argument to the function is an LCID, a Locale ID, which is
    numeric and not defined anywhere I can find it...

    Anyone have any advice?

    tia,
    -- bjorn

  • Neil Hodgson

    #2
    Re: Unicode/utf-8 data in SQL Server

    bjorn:
    I also
    tried creating an Excel VB macro using the StrConv function, but (a) it
    isn't entirely clear to me that this function can do this, and (b) the
    third argument to the function is an LCID, a Locale ID, which is
    numeric and not defined anywhere I can find it...


    Neil

    Comment

    • John Machin

      #3
      Re: Unicode/utf-8 data in SQL Server


      thebjorn wrote:
      I'm working with a MS SQL Server database created by a program from a
      fine US company who seems to have gotten run over by the Unicode truck.
      In their infinite wisdom they've decided to store Unicode data directly
      in regular varchar fields, utf-8 encoded! (on the bright side, it is
      properly utf-8 encoded). One of our customers then wants to use a csv
      file created from a report to import in Excel and is getting an
      attitude when the text shows up "all garbled" (which I can
      understand...)
      >
      One method that works is to use Python to pull down the result set from
      the database, accumulate the entire result text as a big unicode string
      (while decode('utf-8') all text fields in the process) separating each
      field with a tab, before encode('utf-16') the result string and writing
      it to a file opened in binary mode. This ensures that the file gets a
      bom, that it's in a format (utf-16) that Excel can import, and
      hopefully tabs are less common than commas in the source data :-( The
      csv module doesn't support Unicode.
      Last time I looked, *Excel* didn't support csv files in utf-N :-(
      >
      The customer is of the firm belief that our national characters
      (æøå) are part of ascii, presumably because they're
      single-byte-encoded in iso-8859-1. He has no understanding for the
      issues (either by choice or experience) so there is no purpose to
      trying to explain the differences... Be that as it may, he might be
      satisfied with a csv file in that (iso-8859-1) encoding since the local
      version of Excel can import it transparently (with significant
      behind-the-scenes magic I believe...?)
      No magic AFAICT. The bog-standard Windows kit in (north/west/south
      Europe + the English-speaking world) uses code page 1252 (Python:
      'cp1252') which is an MS-molested iso-885-1.

      The customer should be very happy if you do
      text.decode('ut f-8').encode('cp1 252') -- not only should the file
      import into Excel OK, he should be able to view it in
      Word/Notepad/whatever.

      HTH,
      John

      Comment

      • Laurent Pointal

        #4
        Re: Unicode/utf-8 data in SQL Server

        John Machin a écrit :
        The customer should be very happy if you do
        text.decode('ut f-8').encode('cp1 252') -- not only should the file
        import into Excel OK, he should be able to view it in
        Word/Notepad/whatever.
        +
        text.decode('ut f-8').encode('cp1 252',errors='re place')

        As cp1252 may not cover all utf8 chars.

        Laurent.

        Comment

        • John Machin

          #5
          Re: Unicode/utf-8 data in SQL Server

          Laurent Pointal wrote:
          John Machin a écrit :
          The customer should be very happy if you do
          text.decode('ut f-8').encode('cp1 252') -- not only should the file
          import into Excel OK, he should be able to view it in
          Word/Notepad/whatever.
          >
          +
          text.decode('ut f-8').encode('cp1 252',errors='re place')
          >
          As cp1252 may not cover all utf8 chars.
          In that case, the OP may well want to use 'xmlcharrefrepl ace' or
          'backslashrepla ce' as they stand out more than '?' *and* the original
          Unicode is recoverable if necessary e.g.:

          #>>msg = u'\u0124\u0114\ u0139\u013B\u01 50'
          >>print msg
          HELLO
          #>>msg.encode(' cp1252', 'replace')
          '?????'
          #>>msg.encode(' cp1252', 'xmlcharrefrepl ace')
          'ĤĔĹĻŐ'
          #>>msg.encode(' cp1252', 'backslashrepla ce')
          '\\u0124\\u0114 \\u0139\\u013b\ \u0150'
          #>>

          Cheers,
          John

          Comment

          Working...