choosing a server codeset

**Dan van Ginhoven** · Jan 16 '08, 06:05 PM

Re: choosing a server codeset

Hi Frank!

If the database contains national characters other than A-Z, a-z, using
UTF-8, a table column declared as Char(8) will
have room for 4-8 characters, since Characters lika ÅÄÖÉÜ takes 2 bytes in
UTF-8. If you don't work with multiple national languages go for a character
set that suits your situation. If you need to work with XML-data put them in
separate database.
/dg

"Frank Swarbrick" <Frank.Swarbric k@efirstbank.co mwrote in message
news:478CEE61.6 F0F.0085.0@efir stbank.com...

Are there advantages to choosing, say, IBM-1252 over UTF-8? If my PC
application uses code page 1252 will it perform better because no code

page

translation is required? I assume so. What type of performance hit might

I

expect when connecting to a UTF-8 database? What advantages would I get

by

using a UTF-8 database? Obviously it can store the entire Unicode 'plane'
(or whatever that's called), but if my PC can't display it anyway what do

I

really care? And I guess that storing XML data requires UTF-8? But I

don't

think we plan on utilizing this.
>
What else should we know to make our decision?
>
Thanks,
Frank
>

**Colin Booth** · Jan 16 '08, 10:45 PM

Re: choosing a server codeset

Frank Swarbrick wrote:

Are there advantages to choosing, say, IBM-1252 over UTF-8? If my PC
application uses code page 1252 will it perform better because no code
page
translation is required? I assume so. What type of performance hit might
I
expect when connecting to a UTF-8 database? What advantages would I get
by
using a UTF-8 database? Obviously it can store the entire Unicode 'plane'
(or whatever that's called), but if my PC can't display it anyway what do
I
really care? And I guess that storing XML data requires UTF-8? But I
don't think we plan on utilizing this.
>
What else should we know to make our decision?
>
Thanks,
Frank

Hi

Some characters that may be single byte in 1252 are mult-byte in UTF-8. With
a standard UK keyboard I think that there are 3 or 4 characters that are
multi-byte in UTF-8.

I like and prefere UTF-8 but the applications must coded for UTF-8. E.g. if
you have an 8 byte character column and an 8 byte (1252) entry field and
fill the entry field using at least 1 of the UTF-8 multibyte characters you
will get a data truncation error. Also you need to be careful about the
number of characters in a column as the byte count is not necessarily the
character count.

Things are becoming much more global. I have moved to France but still have
some accounts and investments in the UK. I also purchase some things from
the UK and my address contans accents

Colin

**Frank Swarbrick** · Jan 18 '08, 06:05 PM

Re: choosing a server codeset

>>On 1/16/2008 at 3:40 PM, in message <fmm14k$lnc$1@n ews.tiscali.fr> ,
Colin
Booth<colinsboo th@gmail.comwro te:

Frank Swarbrick wrote:
>

>Are there advantages to choosing, say, IBM-1252 over UTF-8? If my PC
>application uses code page 1252 will it perform better because no code
>page
>translation is required? I assume so. What type of performance hit

might

>I
>expect when connecting to a UTF-8 database? What advantages would I get
>by
>using a UTF-8 database? Obviously it can store the entire Unicode

'plane'

>(or whatever that's called), but if my PC can't display it anyway what

do

>I
>really care? And I guess that storing XML data requires UTF-8? But I
>don't think we plan on utilizing this.
>>
>What else should we know to make our decision?
>>
>Thanks,
>Frank

>
Hi
>
Some characters that may be single byte in 1252 are mult-byte in UTF-8.
With
a standard UK keyboard I think that there are 3 or 4 characters that are
multi-byte in UTF-8.
>
I like and prefere UTF-8 but the applications must coded for UTF-8. E.g.
if
you have an 8 byte character column and an 8 byte (1252) entry field and
fill the entry field using at least 1 of the UTF-8 multibyte characters
you
will get a data truncation error. Also you need to be careful about the
number of characters in a column as the byte count is not necessarily
the
character count.
>
Things are becoming much more global. I have moved to France but still
have
some accounts and investments in the UK. I also purchase some things
from
the UK and my address contans accents

I question your comment "the applications must coded for UTF-8". I just
wrote an OpenCobol application with imbedded DB2. No special "UTF-8"
coding, whatever that might mean. All it does is connect to the database,
retrieve the "string" and "hex" values of a set of VARCHAR(25) columns, and
displays those values.

I run this against two databases:
TEST1 is a database defined as codeset IBM-1252.
UTFDB is a database defined as codeset UTF-8.

Here are the results:

CONNECT TO test1
5B544553545D
+0006: [TEST]
7C544553547C
+0006: |TEST|
A654455354A6
+0006: Â¦TESTÂ¦
80
+0001: â‚¬

CONNECT TO utfdb
5B544553545D
+0006: [TEST]
7C544553547C
+0006: |TEST|
C2A654455354C2A 6
+0006: Â¦TESTÂ¦
E282AC
+0001: â‚¬

(+0001: â‚¬ <== that actually shows as the euro symbol in Notepad.)

As you can see, for the UTF-8 database the euro symbol was stored as
x'E282AC'. But since my application used code page 1252 DB2 was smart
enough to translate it to x'80', which is the value for euro in code page
1252.

Now of course when there is a symbol that exists in UTF-8 and not in 1252
then there will be a problem.

I guess your point is, and it's a good one, that if a CHAR or VARCHAR column
is defined in a UTF-8 database then you, in a sense, have to "over define"
the length to take in to account the possibility of multi-byte characters?
For instance, a 1 character field that could possibly contain a multi-byte
UTF-8 character (such as the euro symbol) would have to be defined in the
database as, say, CHAR(3).

This does bring to mind a question I have been pondering. Is there any harm
in defining 'string' fields to be much larger than the largest string length
that you would ever expect? Like an address line. It might be 50 or so
characters. Is there harm in defining it as VARCHAR(250) or even
VARCHAR(32000)? Does it waste space or any other resource?

Thanks for your help.

Frank

choosing a server codeset

choosing a server codeset

Comment

Comment

Comment