choosing a server codeset

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Frank Swarbrick

    choosing a server codeset

    Are there advantages to choosing, say, IBM-1252 over UTF-8? If my PC
    application uses code page 1252 will it perform better because no code page
    translation is required? I assume so. What type of performance hit might I
    expect when connecting to a UTF-8 database? What advantages would I get by
    using a UTF-8 database? Obviously it can store the entire Unicode 'plane'
    (or whatever that's called), but if my PC can't display it anyway what do I
    really care? And I guess that storing XML data requires UTF-8? But I don't
    think we plan on utilizing this.

    What else should we know to make our decision?

    Thanks,
    Frank

  • Dan van Ginhoven

    #2
    Re: choosing a server codeset

    Hi Frank!

    If the database contains national characters other than A-Z, a-z, using
    UTF-8, a table column declared as Char(8) will
    have room for 4-8 characters, since Characters lika ÅÄÖÉÜ takes 2 bytes in
    UTF-8. If you don't work with multiple national languages go for a character
    set that suits your situation. If you need to work with XML-data put them in
    separate database.
    /dg



    "Frank Swarbrick" <Frank.Swarbric k@efirstbank.co mwrote in message
    news:478CEE61.6 F0F.0085.0@efir stbank.com...
    Are there advantages to choosing, say, IBM-1252 over UTF-8? If my PC
    application uses code page 1252 will it perform better because no code
    page
    translation is required? I assume so. What type of performance hit might
    I
    expect when connecting to a UTF-8 database? What advantages would I get
    by
    using a UTF-8 database? Obviously it can store the entire Unicode 'plane'
    (or whatever that's called), but if my PC can't display it anyway what do
    I
    really care? And I guess that storing XML data requires UTF-8? But I
    don't
    think we plan on utilizing this.
    >
    What else should we know to make our decision?
    >
    Thanks,
    Frank
    >

    Comment

    • Colin Booth

      #3
      Re: choosing a server codeset

      Frank Swarbrick wrote:
      Are there advantages to choosing, say, IBM-1252 over UTF-8? If my PC
      application uses code page 1252 will it perform better because no code
      page
      translation is required? I assume so. What type of performance hit might
      I
      expect when connecting to a UTF-8 database? What advantages would I get
      by
      using a UTF-8 database? Obviously it can store the entire Unicode 'plane'
      (or whatever that's called), but if my PC can't display it anyway what do
      I
      really care? And I guess that storing XML data requires UTF-8? But I
      don't think we plan on utilizing this.
      >
      What else should we know to make our decision?
      >
      Thanks,
      Frank
      Hi

      Some characters that may be single byte in 1252 are mult-byte in UTF-8. With
      a standard UK keyboard I think that there are 3 or 4 characters that are
      multi-byte in UTF-8.

      I like and prefere UTF-8 but the applications must coded for UTF-8. E.g. if
      you have an 8 byte character column and an 8 byte (1252) entry field and
      fill the entry field using at least 1 of the UTF-8 multibyte characters you
      will get a data truncation error. Also you need to be careful about the
      number of characters in a column as the byte count is not necessarily the
      character count.

      Things are becoming much more global. I have moved to France but still have
      some accounts and investments in the UK. I also purchase some things from
      the UK and my address contans accents


      Colin

      Comment

      • Frank Swarbrick

        #4
        Re: choosing a server codeset

        >>On 1/16/2008 at 3:40 PM, in message <fmm14k$lnc$1@n ews.tiscali.fr> ,
        Colin
        Booth<colinsboo th@gmail.comwro te:
        Frank Swarbrick wrote:
        >
        >Are there advantages to choosing, say, IBM-1252 over UTF-8? If my PC
        >application uses code page 1252 will it perform better because no code
        >page
        >translation is required? I assume so. What type of performance hit
        might
        >I
        >expect when connecting to a UTF-8 database? What advantages would I get
        >by
        >using a UTF-8 database? Obviously it can store the entire Unicode
        'plane'
        >(or whatever that's called), but if my PC can't display it anyway what
        do
        >I
        >really care? And I guess that storing XML data requires UTF-8? But I
        >don't think we plan on utilizing this.
        >>
        >What else should we know to make our decision?
        >>
        >Thanks,
        >Frank
        >
        Hi
        >
        Some characters that may be single byte in 1252 are mult-byte in UTF-8.
        With
        a standard UK keyboard I think that there are 3 or 4 characters that are
        multi-byte in UTF-8.
        >
        I like and prefere UTF-8 but the applications must coded for UTF-8. E.g.
        if
        you have an 8 byte character column and an 8 byte (1252) entry field and
        fill the entry field using at least 1 of the UTF-8 multibyte characters
        you
        will get a data truncation error. Also you need to be careful about the
        number of characters in a column as the byte count is not necessarily
        the
        character count.
        >
        Things are becoming much more global. I have moved to France but still
        have
        some accounts and investments in the UK. I also purchase some things
        from
        the UK and my address contans accents
        I question your comment "the applications must coded for UTF-8". I just
        wrote an OpenCobol application with imbedded DB2. No special "UTF-8"
        coding, whatever that might mean. All it does is connect to the database,
        retrieve the "string" and "hex" values of a set of VARCHAR(25) columns, and
        displays those values.

        I run this against two databases:
        TEST1 is a database defined as codeset IBM-1252.
        UTFDB is a database defined as codeset UTF-8.

        Here are the results:

        CONNECT TO test1
        5B544553545D
        +0006: [TEST]
        7C544553547C
        +0006: |TEST|
        A654455354A6
        +0006: ¦TEST¦
        80
        +0001: €

        CONNECT TO utfdb
        5B544553545D
        +0006: [TEST]
        7C544553547C
        +0006: |TEST|
        C2A654455354C2A 6
        +0006: ¦TEST¦
        E282AC
        +0001: €

        (+0001: € <== that actually shows as the euro symbol in Notepad.)

        As you can see, for the UTF-8 database the euro symbol was stored as
        x'E282AC'. But since my application used code page 1252 DB2 was smart
        enough to translate it to x'80', which is the value for euro in code page
        1252.

        Now of course when there is a symbol that exists in UTF-8 and not in 1252
        then there will be a problem.

        I guess your point is, and it's a good one, that if a CHAR or VARCHAR column
        is defined in a UTF-8 database then you, in a sense, have to "over define"
        the length to take in to account the possibility of multi-byte characters?
        For instance, a 1 character field that could possibly contain a multi-byte
        UTF-8 character (such as the euro symbol) would have to be defined in the
        database as, say, CHAR(3).

        This does bring to mind a question I have been pondering. Is there any harm
        in defining 'string' fields to be much larger than the largest string length
        that you would ever expect? Like an address line. It might be 50 or so
        characters. Is there harm in defining it as VARCHAR(250) or even
        VARCHAR(32000)? Does it waste space or any other resource?

        Thanks for your help.

        Frank


        Comment

        Working...