Print formatted Strings with Umlauts

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Joerg Lehmann

    Print formatted Strings with Umlauts

    I am using Python 2.2.3 (Fedora Core 1). The problem is, that strings containing
    umlauts do not work as I would expect. Here is my example:
    [color=blue][color=green][color=darkred]
    >>> a = 'äöü'
    >>> b = '123'
    >>> print "%-5s %-5s\n%-5s %-5s" % (a,a,b,b)[/color][/color][/color]
    äöü äöü
    123 123

    I would expect, that the displayed width of a or b is the same: 5 characters.
    I also see, that len(a) is 6 (2 bytes per umlaut), whereas len(b) is 3:
    [color=blue][color=green][color=darkred]
    >>> print len(a), len(b)[/color][/color][/color]
    6 3

    I have tried to set the encoding in site.py to 'latin-1', but it did not change
    my results. Is there no way to store umlauts in 1 byte??? What is the right way
    to print strings containing umlauts in a tabular way (same field width)?

    Thanks!
    --
    Joerg Lehmann
  • Amy G

    #2
    Re: Print formatted Strings with Umlauts

    Upgrading to 2.3 will probablt solve this problem. I am using 2.3 and here
    is what I get when I try it.
    [color=blue][color=green][color=darkred]
    >>> a = 'äöü'
    >>> len (a)[/color][/color][/color]
    3
    [color=blue][color=green][color=darkred]
    >>> b = '123'
    >>> print "%-5s %-5s\n%-5s %-5s" % (a,a,b,b)[/color][/color][/color]

    äöü äöü
    123 123




    "Joerg Lehmann" <joerg.lehmann@ mail.com> wrote in message
    news:91317660.0 402111249.4ccc6 e24@posting.goo gle.com...[color=blue]
    > I am using Python 2.2.3 (Fedora Core 1). The problem is, that strings[/color]
    containing[color=blue]
    > umlauts do not work as I would expect. Here is my example:
    >[color=green][color=darkred]
    > >>> a = 'äöü'
    > >>> b = '123'
    > >>> print "%-5s %-5s\n%-5s %-5s" % (a,a,b,b)[/color][/color]
    > äöü äöü
    > 123 123
    >
    > I would expect, that the displayed width of a or b is the same: 5[/color]
    characters.[color=blue]
    > I also see, that len(a) is 6 (2 bytes per umlaut), whereas len(b) is 3:
    >[color=green][color=darkred]
    > >>> print len(a), len(b)[/color][/color]
    > 6 3
    >
    > I have tried to set the encoding in site.py to 'latin-1', but it did not[/color]
    change[color=blue]
    > my results. Is there no way to store umlauts in 1 byte??? What is the[/color]
    right way[color=blue]
    > to print strings containing umlauts in a tabular way (same field width)?
    >
    > Thanks!
    > --
    > Joerg Lehmann[/color]


    Comment

    • Jeff Epler

      #3
      Re: Print formatted Strings with Umlauts

      If you work with Unicode strings instead of byte strings in the utf-8
      encoding, you'll get the desired results for characters in the german
      character set:
      [color=blue][color=green][color=darkred]
      >>> b = '123'
      >>> a = u'\344\366\374'
      >>> print (u"%-5s %-5s\n%-5s %-5s" % (a, a, b, b)).encode("utf-8")[/color][/color][/color]
      äöü äöü
      123 123

      However, this isn't good enough in general. For instance, in the
      presence of Unicode combining characters, you won't get what you want:[color=blue][color=green][color=darkred]
      >>> u = u'\N{COMBINING DIAERESIS}'
      >>> a = 'a%so%su%s' % (u,u,u)
      >>> print a.encode("utf-8")[/color][/color][/color]
      äöü[color=blue][color=green][color=darkred]
      >>> print (u"%-5s %-5s\n%-5s %-5s" % (a, a, b, b)).encode("utf-8")[/color][/color][/color]
      äöü äöü
      123 123


      You'll also run into problems with characters that have "Wide" or
      "Ambiguous" East Asian Width properties in Unicode. For example,[color=blue][color=green][color=darkred]
      >>> a = u'\N{FULLWIDTH LATIN SMALL LETTER U}' * 3
      >>> print (u"%-5s %-5s\n%-5s %-5s" % (a, a, b, b)).encode("utf-8")[/color][/color][/color]
      uuu uuu
      123 123

      Jeff

      Comment

      • Martin v. Löwis

        #4
        Re: Print formatted Strings with Umlauts

        Joerg Lehmann wrote:[color=blue]
        > I am using Python 2.2.3 (Fedora Core 1). ...
        > I have tried to set the encoding in site.py to 'latin-1', but it did not change
        > my results. Is there no way to store umlauts in 1 byte???[/color]

        There is, but Fedora Core 1 does not use it. Instead, it uses an
        encoding where an umlaut character needs two bytes (namely, UTF-8).
        Changing site.py does not change the way your system represents
        these characters.
        [color=blue]
        > What is the right way
        > to print strings containing umlauts in a tabular way (same field width)?[/color]

        As Jeff explains: In the specific case, using Unicode strings would
        help. He is also right that, in general, it is very difficult to find
        out how many columns a single character uses, as some characters have
        width 0, and other characters have width 2 (in a mono-spaced terminal;
        for variable-spaced output, adding space characters to achieve
        formatting will never work reliably).

        Regards,
        Martin

        Comment

        • Joerg Lehmann

          #5
          Re: Print formatted Strings with Umlauts

          "Martin v. Löwis" <martin@v.loewi s.de> wrote in message news:<c0f8tb$5i u$05$1@news.t-online.com>...[color=blue]
          > Joerg Lehmann wrote:[color=green]
          > > I am using Python 2.2.3 (Fedora Core 1). ...
          > > I have tried to set the encoding in site.py to 'latin-1', but it did not change
          > > my results. Is there no way to store umlauts in 1 byte???[/color]
          >
          > There is, but Fedora Core 1 does not use it. Instead, it uses an
          > encoding where an umlaut character needs two bytes (namely, UTF-8).
          > Changing site.py does not change the way your system represents
          > these characters.
          >[color=green]
          > > What is the right way
          > > to print strings containing umlauts in a tabular way (same field width)?[/color]
          >
          > As Jeff explains: In the specific case, using Unicode strings would
          > help. He is also right that, in general, it is very difficult to find
          > out how many columns a single character uses, as some characters have
          > width 0, and other characters have width 2 (in a mono-spaced terminal;
          > for variable-spaced output, adding space characters to achieve
          > formatting will never work reliably).
          >
          > Regards,
          > Martin[/color]

          I have found a fix myself, I'm not sure if this is "the right way",
          but it solves my problem:

          I changed the settings in /etc/sysconfig/i18ln from UTF-8 to
          ISO-8859-1:

          LANG="en_US.ISO-8859-1"
          SUPPORTED="en_U S.ISO-8859-1:en_US:en"
          SYSFONT="latarc yrheb-sun16"

          This fixed my problem, Umlauts are stored in one byte now.

          Thanks for your inspirations.

          PS: Installing Python 2.3 (rpm for Fedora from www.python.org) did not
          help.
          --
          Joerg Lehmann

          Comment

          Working...