Trouble saving unicode text to file

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Svennglenn

    Trouble saving unicode text to file

    I'm working on a program that is supposed to save
    different information to text files.

    Because the program is in swedish i have to use
    unicode text for ÅÄÖ letters.

    When I run the following testscript I get an error message.

    # -*- coding: cp1252 -*-

    titel = "åäö"
    titel = unicode(titel)

    print "Titel type", type(titel)

    fil = open("testfil.t xt", "w")
    fil.write(titel )
    fil.close()


    Traceback (most recent call last):
    File "D:\Documen ts and
    Settings\Daniel \Desktop\Progra mmering\aaotest \aaotest2\aaote st2.pyw",
    line 5, in ?
    titel = unicode(titel)
    UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xe5 in position 0:
    ordinal not in range(128)


    I need to have the titel variable in unicode format because when I
    write
    åäö in a entry box in Tkinkter it makes the value to a unicode
    format
    automaticly.

    Are there anyone who knows an easy way to save this unicode format text
    to a file?

  • Skip Montanaro

    #2
    Re: Trouble saving unicode text to file


    Svennglenn> Traceback (most recent call last):
    Svennglenn> File "D:\Documen ts and
    Svennglenn> Settings\Daniel \Desktop\Progra mmering\aaotest \aaotest2\aaote st2.pyw",
    Svennglenn> line 5, in ?
    Svennglenn> titel = unicode(titel)
    Svennglenn> UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xe5 in position 0:
    Svennglenn> ordinal not in range(128)

    Try:

    import codecs

    titel = "åäö"
    titel = unicode(titel, "iso-8859-1")
    fil = codecs.open("te stfil.txt", "w", "iso-8859-1")
    fil.write(titel )
    fil.close()

    Skip

    Comment

    • John Machin

      #3
      Re: Trouble saving unicode text to file

      On 7 May 2005 14:22:56 -0700, "Svennglenn " <Danielnord15@y ahoo.se>
      wrote:
      [color=blue]
      >I'm working on a program that is supposed to save
      >different information to text files.
      >
      >Because the program is in swedish i have to use
      >unicode text for ÅÄÖ letters.[/color]

      "program is in Swedish": to the extent that this means "names of
      variables are in Swedish", this is quite irrelevant. The variable
      names could be in some other language, like Slovak, Slovenian, Swahili
      or Strine. Your problem(s) (PLURAL) arise from the fact that your text
      data is in Swedish, the representation of which uses a few non-ASCII
      characters. Problem 1 is the representation of Swedish in text
      constants in your program; this is causing the exception you show
      below but curiously didn't ask for help with.
      [color=blue]
      >
      >When I run the following testscript I get an error message.
      >
      ># -*- coding: cp1252 -*-
      >
      >titel = "åäö"
      >titel = unicode(titel)[/color]

      You should use titel = u"åäö"
      Works, and saves wear & tear on your typing fingers.
      [color=blue]
      >
      >print "Titel type", type(titel)
      >
      >fil = open("testfil.t xt", "w")
      >fil.write(tite l)
      >fil.close()
      >
      >
      >Traceback (most recent call last):
      > File "D:\Documen ts and
      >Settings\Danie l\Desktop\Progr ammering\aaotes t\aaotest2\aaot est2.pyw",
      >line 5, in ?
      > titel = unicode(titel)
      >UnicodeDecodeE rror: 'ascii' codec can't decode byte 0xe5 in position 0:
      >ordinal not in range(128)
      >
      >
      >I need to have the titel variable in unicode format because when I
      >write
      >åäö in a entry box in Tkinkter it makes the value to a unicode
      >format
      >automaticly.[/color]

      The general rule in working with Unicode can be expressed something
      like "work in Unicode all the time i.e. decode legacy text as early as
      possible; encode into legacy text (if absolutely required) as late as
      possible (corollary: if forced to communicate with another
      Unicode-aware system over an 8-bit wide channel, encode as utf-8, not
      cp666)"

      Applying this to Problem 1 is, as you've seen, trivial: To the extent
      that you have text constants at all in your program, they should be in
      Unicode.

      Now after all that, Problem 2: how to save Unicode text to a file?

      Which raises a question: who or what is going to read your file? If a
      Unicode-aware application, and never a human, you might like to
      consider encoding the text as utf-16. If Unicode-aware app plus
      (occasional human developer or not CJK and you want to save space),
      try utf-8. For general use on Windows boxes in the Latin1 subset of
      the universe, you'll no doubt want to encode as cp1252.
      [color=blue]
      >
      >Are there anyone who knows an easy way to save this unicode format text
      >to a file?[/color]

      Read the docs of the codecs module -- skipping over how to register
      codecs, just concentrate on using them.

      Try this:

      # -*- coding: cp1252 -*-
      import codecs
      titel = u"åäö"
      print "Titel type", type(titel)
      f1 = codecs.open('ti tel.u16', 'wb', 'utf_16')
      f2 = codecs.open('ti tel.u8', 'w', 'utf_8')
      f3 = codecs.open('ti tel.txt', 'w', 'cp1252')
      # much later, maybe in a different function
      # maybe even in a different module
      f1.write(titel)
      f2.write(titel)
      f3.write(titel)
      # much later
      f1.close()
      f2.close()
      f3.close()

      Note: doing it this way follows the "encode as late as possible" rule
      and documents the encoding for the whole file, in one place. Other
      approaches which might use the .encode() method of Unicode strings and
      then write the 8-bit-string results at different times and in
      different functions/modules are somewhat less clean and more prone to
      mistakes.

      HTH,
      John

      Comment

      • John Machin

        #4
        Re: Trouble saving unicode text to file

        On Sat, 7 May 2005 17:25:28 -0500, Skip Montanaro <skip@pobox.com >
        wrote:
        [color=blue]
        >
        > Svennglenn> Traceback (most recent call last):
        > Svennglenn> File "D:\Documen ts and
        > Svennglenn> Settings\Daniel \Desktop\Progra mmering\aaotest \aaotest2\aaote st2.pyw",
        > Svennglenn> line 5, in ?
        > Svennglenn> titel = unicode(titel)
        > Svennglenn> UnicodeDecodeEr ror: 'ascii' codec can't decode byte 0xe5 in position 0:
        > Svennglenn> ordinal not in range(128)
        >
        >Try:
        >
        > import codecs
        >
        > titel = "åäö"
        > titel = unicode(titel, "iso-8859-1")
        > fil = codecs.open("te stfil.txt", "w", "iso-8859-1")
        > fil.write(titel )
        > fil.close()
        >[/color]

        I tried that, with this result:

        C:\junk>python skip.py
        sys:1: DeprecationWarn ing: Non-ASCII character '\xe5' in file skip.py
        on line 3, but no encoding declared; see http://www.python.org
        /peps/pep-0263.html for details

        1. An explicit PEP 263 declaration (which the OP already had!) should
        be used, rather than relying on the default, which doesn't work in
        general if you substituted say Polish or Russian for Swedish.

        2. My bet is that 'cp1252' is more likely to be appropriate for the OP
        than 'iso-8859-1'. The encodings are quite different in range(0x80,
        0xA0). They coincidentally give the same result for the OP's limited
        sample. However if for example the OP needs to use the euro character
        which is 0x80 in cp1252, it wouldn't show up as a problem in the
        limited scripts we've been playing with so far, but 0x80 in the script
        is sure not going to look like a euro in Tkinter if it's being decoded
        via iso-8859-1. Your rationale for using iso-8859-1 when the OP had
        already mentioned cp1252 was ... what?



        Comment

        • Ivan Van Laningham

          #5
          Re: Trouble saving unicode text to file

          Hi All--

          John Machin wrote:[color=blue]
          >
          >
          > The general rule in working with Unicode can be expressed something
          > like "work in Unicode all the time i.e. decode legacy text as early as
          > possible; encode into legacy text (if absolutely required) as late as
          > possible (corollary: if forced to communicate with another
          > Unicode-aware system over an 8-bit wide channel, encode as utf-8, not
          > cp666)"
          >[/color]

          +1 QOTW

          And true, too.

          <i-especially-like-the-cp666-part>-ly y'rs,
          Ivan
          ----------------------------------------------
          Ivan Van Laningham
          God N Locomotive Works


          Army Signal Corps: Cu Chi, Class of '70
          Author: Teach Yourself Python in 24 Hours

          Comment

          • Martin v. Löwis

            #6
            Re: Trouble saving unicode text to file

            Svennglenn wrote:[color=blue]
            > # -*- coding: cp1252 -*-
            >
            > titel = "åäö"
            > titel = unicode(titel)[/color]

            Instead of this, just write

            # -*- coding: cp1252 -*-

            titel = u"åäö"
            [color=blue]
            > fil = open("testfil.t xt", "w")
            > fil.write(titel )
            > fil.close()[/color]

            Instead of this, write

            import codecs
            fil = codecs.open("te stfil.txt", "w", "cp1252")
            fil.write(titel )
            fil.close()

            Instead of cp1252, consider using ISO-8859-1.

            Regards,
            Martin

            Comment

            • John Machin

              #7
              Re: Trouble saving unicode text to file

              On Sun, 08 May 2005 11:23:49 +0200, "Martin v. Löwis"
              <martin@v.loewi s.de> wrote:
              [color=blue]
              >Svennglenn wrote:[color=green]
              >> # -*- coding: cp1252 -*-
              >>
              >> titel = "åäö"
              >> titel = unicode(titel)[/color]
              >
              >Instead of this, just write
              >
              ># -*- coding: cp1252 -*-
              >
              >titel = u"åäö"
              >[color=green]
              >> fil = open("testfil.t xt", "w")
              >> fil.write(titel )
              >> fil.close()[/color]
              >
              >Instead of this, write
              >
              >import codecs
              >fil = codecs.open("te stfil.txt", "w", "cp1252")
              >fil.write(tite l)
              >fil.close()
              >
              >Instead of cp1252, consider using ISO-8859-1.[/color]

              Martin, I can't guess the reason for this last suggestion; why should
              a Windows system use iso-8859-1 instead of cp1252?

              Regards,
              John


              Comment

              • Martin v. Löwis

                #8
                Re: Trouble saving unicode text to file

                John Machin wrote:[color=blue]
                > Martin, I can't guess the reason for this last suggestion; why should
                > a Windows system use iso-8859-1 instead of cp1252?[/color]

                Windows users often think that windows-1252 is the same thing as
                iso-8859-1, and then exchange data in windows-1252, but declare them
                as iso-8859-1 (in particular, this is common for HTML files).
                iso-8859-1 is more portable than windows-1252, so it should be
                preferred when the data need to be exchanged across systems.

                Regards,
                Martin

                Comment

                • John Machin

                  #9
                  Re: Trouble saving unicode text to file

                  On Sun, 08 May 2005 19:49:42 +0200, "Martin v. Löwis"
                  <martin@v.loewi s.de> wrote:
                  [color=blue]
                  >John Machin wrote:[color=green]
                  >> Martin, I can't guess the reason for this last suggestion; why should
                  >> a Windows system use iso-8859-1 instead of cp1252?[/color]
                  >
                  >Windows users often think that windows-1252 is the same thing as
                  >iso-8859-1, and then exchange data in windows-1252, but declare them
                  >as iso-8859-1 (in particular, this is common for HTML files).
                  >iso-8859-1 is more portable than windows-1252, so it should be
                  >preferred when the data need to be exchanged across systems.[/color]

                  Martin, it seems I'm still a long way short of enlightenment; please
                  bear with me:

                  Terminology disambiguation: what I call "users" wouldn't know what
                  'cp1252' and 'iso-8859-1' were. They're not expected to know. They
                  just type in whatever characters they can see on their keyboard or
                  find in the charmap utility. It's what I'd call 'admins' and
                  'developers' who should know better, but often don't.

                  1. When exchanging data across systems, should not utf-8 be
                  preferred???

                  2. If the Windows *users* have been using characters that are in
                  cp1252 but not in iso-8859-1, then attempting to convert to iso-8859-1
                  will cause an exception.
                  [color=blue][color=green][color=darkred]
                  >>> euro_win = chr(128)
                  >>> euro_uc = euro_win.decode ('cp1252')
                  >>> euro_uc[/color][/color][/color]
                  u'\u20ac'[color=blue][color=green][color=darkred]
                  >>> unicodedata.nam e(euro_uc)[/color][/color][/color]
                  'EURO SIGN'[color=blue][color=green][color=darkred]
                  >>> euro_iso = euro_uc.encode( 'iso-8859-1')[/color][/color][/color]
                  Traceback (most recent call last):
                  File "<stdin>", line 1, in ?
                  UnicodeEncodeEr ror: 'latin-1' codec can't encode character u'\u20ac'
                  in position 0: ordinal not in range(256)[color=blue][color=green][color=darkred]
                  >>>[/color][/color][/color]

                  I find it a bit hard to imagine that the euro sign wouldn't get a fair
                  bit of usage in Swedish data processing even if it's not their own
                  currency.

                  3. How portable is a character set that doesn't include the euro sign?

                  Regards,
                  John

                  Comment

                  • F. Petitjean

                    #10
                    Re: Trouble saving unicode text to file

                    Le Mon, 09 May 2005 08:39:40 +1000, John Machin a écrit :[color=blue]
                    > On Sun, 08 May 2005 19:49:42 +0200, "Martin v. Löwis"
                    ><martin@v.loew is.de> wrote:
                    >[color=green]
                    >>John Machin wrote:[color=darkred]
                    >>> Martin, I can't guess the reason for this last suggestion; why should
                    >>> a Windows system use iso-8859-1 instead of cp1252?[/color]
                    >>
                    >>Windows users often think that windows-1252 is the same thing as
                    >>iso-8859-1, and then exchange data in windows-1252, but declare them
                    >>as iso-8859-1 (in particular, this is common for HTML files).
                    >>iso-8859-1 is more portable than windows-1252, so it should be
                    >>preferred when the data need to be exchanged across systems.[/color]
                    >
                    > 1. When exchanging data across systems, should not utf-8 be
                    > preferred???
                    >
                    > 2. If the Windows *users* have been using characters that are in
                    > cp1252 but not in iso-8859-1, then attempting to convert to iso-8859-1
                    > will cause an exception.
                    >[color=green][color=darkred]
                    >>>> euro_win = chr(128)
                    >>>> euro_uc = euro_win.decode ('cp1252')
                    >>>> euro_uc[/color][/color]
                    > u'\u20ac'[color=green][color=darkred]
                    >>>> unicodedata.nam e(euro_uc)[/color][/color]
                    > 'EURO SIGN'[color=green][color=darkred]
                    >>>> euro_iso = euro_uc.encode( 'iso-8859-1')[/color][/color]
                    > Traceback (most recent call last):
                    > File "<stdin>", line 1, in ?
                    > UnicodeEncodeEr ror: 'latin-1' codec can't encode character u'\u20ac'
                    > in position 0: ordinal not in range(256)[color=green][color=darkred]
                    >>>>[/color][/color]
                    >
                    > I find it a bit hard to imagine that the euro sign wouldn't get a fair
                    > bit of usage in Swedish data processing even if it's not their own
                    > currency.[/color]
                    For western Europe countries, another codec exists which includes the
                    'EURO SIGN'. It is spelled 'iso8859_15' (with an alias 'iso-8859-15'
                    according to the 4.9.2 Standard Encodings page of the python library
                    reference).
                    euro_iso = euro_uc.encode( 'iso8859_15')[color=blue][color=green][color=darkred]
                    >>> euro_iso[/color][/color][/color]
                    '\xa4'[color=blue]
                    >
                    > 3. How portable is a character set that doesn't include the euro sign?[/color]
                    I think it is due to historical constraints : isoLatin1 existed before
                    that the EURO SIGN appeared.[color=blue]
                    >
                    > Regards,
                    > John[/color]

                    Comment

                    • Fredrik Lundh

                      #11
                      Re: Trouble saving unicode text to file

                      John Machin wrote:
                      [color=blue]
                      > I find it a bit hard to imagine that the euro sign wouldn't get a fair
                      > bit of usage in Swedish data processing even if it's not their own
                      > currency.[/color]

                      it's spelled "Euro" or "EUR" in swedish.

                      (if you live in a country that use letters to represent its own currency,
                      you tend to prefer letters for "foreign" currencies as well)

                      (I just noticed that there's no euro sign on my swedish keyboard. I've
                      never missed it ;-)

                      </F>



                      Comment

                      • Max M

                        #12
                        Re: Trouble saving unicode text to file

                        Fredrik Lundh wrote:
                        [color=blue]
                        > (I just noticed that there's no euro sign on my swedish keyboard. I've
                        > never missed it ;-)[/color]

                        It's probably "AltGR + E" like here in DK

                        --

                        hilsen/regards Max M, Denmark


                        IT's Mad Science

                        Comment

                        • Simon Brunning

                          #13
                          Re: Trouble saving unicode text to file

                          On 5/9/05, Max M <maxm@mxm.dk> wrote:[color=blue]
                          > Fredrik Lundh wrote:
                          > [color=green]
                          > > (I just noticed that there's no euro sign on my swedish keyboard. I've
                          > > never missed it ;-)[/color]
                          >
                          > It's probably "AltGR + E" like here in DK[/color]

                          My UK keyboard has it as AltGr + 4, FWIW.

                          --
                          Cheers,
                          Simon B,
                          simon@brunningo nline.net,

                          Comment

                          • Fredrik Lundh

                            #14
                            Re: Trouble saving unicode text to file

                            Max M wrote:
                            [color=blue][color=green]
                            >> (I just noticed that there's no euro sign on my swedish keyboard. I've
                            >> never missed it ;-)[/color]
                            >
                            > It's probably "AltGR + E" like here in DK[/color]

                            ah, there it is. almost entirely worn out. and it doesn't work. but a little
                            fooling around reveals that AltGr+5 does work. oh well, you learn some-
                            thing new every day.

                            </F>



                            Comment

                            • Martin v. Löwis

                              #15
                              Re: Trouble saving unicode text to file

                              John Machin wrote:[color=blue]
                              > Terminology disambiguation: what I call "users" wouldn't know what
                              > 'cp1252' and 'iso-8859-1' were. They're not expected to know. They
                              > just type in whatever characters they can see on their keyboard or
                              > find in the charmap utility. It's what I'd call 'admins' and
                              > 'developers' who should know better, but often don't.[/color]

                              I was talking about 'users' of Python, so they are 'developers'.
                              They often don't know what cp1252 is.
                              [color=blue]
                              > 1. When exchanging data across systems, should not utf-8 be
                              > preferred???[/color]

                              It depends on the data, of course. People writing UTF-8 into
                              text files often find that their editors don't display them
                              correctly, in which case UTF-8 might not be the best choice.
                              For example, the Python source code in CVS is required to be
                              iso-8859-1, primarily because this is what interoperates best
                              across all development platforms.

                              For data in XHTML, the answer would be different: every XML
                              processor is supposed to support UTF-8.
                              [color=blue]
                              > 2. If the Windows *users* have been using characters that are in
                              > cp1252 but not in iso-8859-1, then attempting to convert to iso-8859-1
                              > will cause an exception.[/color]

                              Correct.
                              [color=blue]
                              > I find it a bit hard to imagine that the euro sign wouldn't get a fair
                              > bit of usage in Swedish data processing even if it's not their own
                              > currency.[/color]

                              Yes, so the question is how to represent it. It all depends on the
                              application, but it is safer to only assume iso-8859-1 for the moment,
                              unless it is guaranteed that all code that reads the file in really
                              knows what cp1252 is, and what \x80 means in that charset.
                              [color=blue]
                              > 3. How portable is a character set that doesn't include the euro sign?[/color]

                              Well, how portable is ASCII? It doesn't support certain characters,
                              sure. If you don't need these characters, this is not a problem. If
                              you do need the extra characters, you need to think thoroughly what
                              encoding meets your needs best. I was merely suggesting that cp1252
                              is often used without that thought, causing moji-bake later.

                              If representation of the euro sign is an issue, the choices are
                              iso-8859-15, cp1252, and UTF-8. Of those three, I would pick
                              cp1252 last if at all possible, because it is specific to a
                              vendor (i.e. non-standard)

                              Regards,
                              Martin

                              Comment

                              Working...