Reading Windows CSV file with LCID entries under Linux.

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Thomas Troeger

    Reading Windows CSV file with LCID entries under Linux.

    Dear all,

    I've stumbled over a problem with Windows Locale ID information and
    codepages. I'm writing a Python application that parses a CSV file,
    the format of a line in this file is "LCID;Text1;Tex t2". Each line can
    contain a different locale id (LCID) and the text fields contain data
    that is encoded in some codepage which is associated with this LCID. My
    current data file contains the codes 1033 for German and 1031 for
    English US (as listed in
    http://www.microsoft.com/globaldev/r...lcid-all.mspx).
    Unfortunately, I cannot find out which Codepage (like cp-1252 or
    whatever) belongs to which LCID.

    My question is: How can I convert this data into something more
    reasonable like unicode? Basically, what I want is something like
    "Text1;Text 2", both fields encoded as UTF-8. Can this be done with
    Python? How can I find out which codepage I have to use for 1033 and 1031?

    Any help appreciated,
    Thomas.
  • skip@pobox.com

    #2
    Re: Reading Windows CSV file with LCID entries under Linux.


    ThomasMy question is: How can I convert this data into something more
    Thomasreasonabl e like unicode? Basically, what I want is something
    Thomaslike "Text1;Text 2", both fields encoded as UTF-8. Can this be
    Thomasdone with Python? How can I find out which codepage I have to
    Thomasuse for 1033 and 1031?

    There are examples at end of the CSV module documentation which show how to
    create Unicode readers and writers. You can extend the UnicodeReader class
    to peek at the LCID field and save the corresponding codepage for the
    remainder of the line. (This would assume you're not creating CSV files
    which contain newlines. Each line read would be assumed to be a new record
    in the file.)

    Skip

    Comment

    • Tim Golden

      #3
      Re: Reading Windows CSV file with LCID entries under Linux.

      Thomas Troeger wrote:
      I've stumbled over a problem with Windows Locale ID information and
      codepages. I'm writing a Python application that parses a CSV file,
      the format of a line in this file is "LCID;Text1;Tex t2". Each line can
      contain a different locale id (LCID) and the text fields contain data
      that is encoded in some codepage which is associated with this LCID. My
      current data file contains the codes 1033 for German and 1031 for
      English US (as listed in
      http://www.microsoft.com/globaldev/r...lcid-all.mspx).
      Unfortunately, I cannot find out which Codepage (like cp-1252 or
      whatever) belongs to which LCID.
      >
      My question is: How can I convert this data into something more
      reasonable like unicode? Basically, what I want is something like
      "Text1;Text 2", both fields encoded as UTF-8. Can this be done with
      Python? How can I find out which codepage I have to use for 1033 and 1031?

      The GetLocaleInfo API call can do that conversion:



      You'll need to use ctypes (or write a c extension) to
      use it. Be aware that if it doesn't succeed you may need
      to fall back on cp 65001 -- utf8.

      TJG

      Comment

      Working...