Removing Unicode from Python?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Paradox

    Removing Unicode from Python?

    In general I love Python for text manipulation but at our company we
    have the need to manipulate large text values stored in either a SQL
    Server database or text files. This data is stored in a "text" field
    type and is definitely not unicode though it is often very strange
    text since it is either OCR or some kinda electronic file extraction.
    Unfortunately when it is retrieved into a string type in python it is
    invariably a unicode type string. The best I can do is try and encode
    it to 'latin-1' but that will often throw and error if I use the
    ignore parameter then it will wack my data with a bunch of "?". I am
    just not understanding why python is thinking stuff is unicode and why
    it is failing on conversion. There is no way that a byte can not be
    between 0 and 255 right? This problem can be so haunting that I will
    start to wish I had coded the solution in VB where at least a string
    is a string is a string. Is there a way to modify Python so that all
    strings will always be single byte strings since we have no need for
    Unicode support? Any solutions or suggestions to my biggest Python
    annoyance would be greatly appreciated.

    Thanks Joey
  • Martin v. Löwis

    #2
    Re: Removing Unicode from Python?

    Paradox wrote:
    [color=blue]
    > In general I love Python for text manipulation but at our company we
    > have the need to manipulate large text values stored in either a SQL
    > Server database or text files. This data is stored in a "text" field
    > type and is definitely not unicode though it is often very strange
    > text since it is either OCR or some kinda electronic file extraction.
    > Unfortunately when it is retrieved into a string type in python it is
    > invariably a unicode type string. The best I can do is try and encode
    > it to 'latin-1' but that will often throw and error if I use the
    > ignore parameter then it will wack my data with a bunch of "?".[/color]

    Can you give an example of such string? Reporting its repr() would help.

    If you want to encode arbitrary Unicode strings into byte strings, you
    can use "utf-8" as the encoding.

    Regards,
    Martin

    Comment

    • Neil Hodgson

      #3
      Re: Removing Unicode from Python?

      Paradox:
      [color=blue]
      > I am
      > just not understanding why python is thinking stuff is unicode and why
      > it is failing on conversion. There is no way that a byte can not be
      > between 0 and 255 right? This problem can be so haunting that I will
      > start to wish I had coded the solution in VB where at least a string
      > is a string is a string.[/color]

      In VB a string is a BSTR is a Unicode string.


      Neil


      Comment

      • George Kinney

        #4
        Re: Removing Unicode from Python?

        There is no way that a byte can not be[color=blue]
        > between 0 and 255 right? This problem can be so haunting that I will
        > start to wish I had coded the solution in VB where at least a string
        > is a string is a string. Is there a way to modify Python so that all
        > strings will always be single byte strings since we have no need for
        > Unicode support? Any solutions or suggestions to my biggest Python
        > annoyance would be greatly appreciated.[/color]

        All MS products use unicode strings. All the time. Its integral to
        the OS and all its libraries.

        VB and other MS offspring allow you to ignore that fact, but they
        don't make it go away.

        Python is just doing what it should do: handle unicode strings as unicode
        strings.


        Comment

        • Tim Roberts

          #5
          Re: Removing Unicode from Python?

          Brian Quinlan <brian@sweetapp .com> wrote:
          [color=blue][color=green]
          >> All MS products use unicode strings. All the time. Its integral to
          >> the OS and all its libraries.[/color]
          >
          >This statement is obviously false.[/color]

          Not really. The core Windows 2000 and XP operating systems are exclusively
          Unicode. When you call one of the ASCII APIs, it converts every string to
          Unicode, calls the Unicode API which does the real work, converts any
          output parameters back to ASCII, and returns them to you.

          As you might imagine, all of those conversions cost time. Thus,
          Microsoft's application products work natively in Unicode and use the
          Unicode APIs when they are available.
          [color=blue]
          >But the SQL Server "text" type is not a Unicode type.[/color]

          And that means, among other things, that it cannot handle international
          character sets reasonably. There is no agreement as to what the character
          0xBF is, whereas there IS standards-based agreement on the meaning of the
          Unicode code point u00BF.
          --
          - Tim Roberts, timr@probo.com
          Providenza & Boekelheide, Inc.

          Comment

          • jack

            #6
            Re: Removing Unicode from Python?

            On 29 Oct 2003 23:12:39 -0800, Paradox wrote:
            [color=blue]
            > In general I love Python for text manipulation but at our company we
            > have the need to manipulate large text values stored in either a SQL
            > Server database or text files. This data is stored in a "text" field
            > type and is definitely not unicode though it is often very strange
            > text since it is either OCR or some kinda electronic file extraction.
            > Unfortunately when it is retrieved into a string type in python it is
            > invariably a unicode type string. The best I can do is try and encode
            > it to 'latin-1' but that will often throw and error if I use the
            > ignore parameter then it will wack my data with a bunch of "?". I am
            > just not understanding why python is thinking stuff is unicode and why
            > it is failing on conversion. There is no way that a byte can not be
            > between 0 and 255 right? This problem can be so haunting that I will
            > start to wish I had coded the solution in VB where at least a string
            > is a string is a string. Is there a way to modify Python so that all
            > strings will always be single byte strings since we have no need for
            > Unicode support? Any solutions or suggestions to my biggest Python
            > annoyance would be greatly appreciated.
            >
            > Thanks Joey[/color]

            i had a simpilar problem with SQL Server. my solution was to create a
            sitecustomize.p y file containing:

            import sys
            sys.setdefaulte ncoding("utf-8")


            this works for me and turns off unicode for everything. i was unable to
            find any other solution that i could understand. (i'm not a programmer and
            have only just started with python).

            jack
            sidelined in order to prevent discrimination on the gender front

            Comment

            Working...