character sets

**Jon Skeet [C# MVP]** · Sep 12 '08, 06:05 AM

Re: character sets

On Sep 12, 3:05 am, "Paul W" <nos...@pw-review.comwrote :

I have an application that reads data in from a text file and stores it in a
database. My problem is that there are some characters in the file that
aren't being handled properly. For instance, one of the characters hasan
ASCII code of 150 (it looks like a dash '-')

There's no such thing as "ASCII code of 150" - ASCII only goes as far
as 150.

I *suspect* that Encoding.Defaul t is what you're after, but read
http://pobox.com/~skeet/csharp/unicode.html and
http://pobox.com/~skeet/csharp/debuggingunicode.html for more
information.

Jon

**=?Utf-8?B?TW9ydGVuIFdlbm5ldmlrIFtDIyBNVlBd?=** · Sep 12 '08, 06:15 AM

RE: character sets

"Paul W" wrote:

Hi all,
>
I have an application that reads data in from a text file and stores it in a
database. My problem is that there are some characters in the file that
aren't being handled properly. For instance, one of the characters has an
ASCII code of 150 (it looks like a dash '-'), when I'm debugging this
character is displayed as the square box that Windows uses for unsupported
characters and when it's copied to the database it's stored as '?'.
>
I've played with the encoding while reading the file but the default
encoding still works the best for all of the data. I can copy this
character to a simple texr editor like Notepad and it's displayed properly.
The problem seems to be that the .net character set used is OEM when what I
want is the ANSI character set. Can anyone help me with reading in all of
the characters in the file. Thanks in advance.
>
--Paul
>
>
>

Hi Paul,

It looks like the default encoding is not the correct one. An ANSI
character should be readable in any codepage although it may not display the
correct character. For comparison, ANSI character 150 is Ã» on my system. If
you open the file in Notepad and select Save As ... does it opt for ANSI,
UTF8 or Unicode. It ANSI, do you get the file from another country/system
running potentially other codepages?

--
Happy Coding!
Morten Wennevik [C# MVP]

**=?Utf-8?B?TW9ydGVuIFdlbm5ldmlrIFtDIyBNVlBd?=** · Sep 12 '08, 06:35 AM

RE: character sets

"Morten Wennevik [C# MVP]" wrote:

>
"Paul W" wrote:
>

Hi all,

I have an application that reads data in from a text file and stores it in a
database. My problem is that there are some characters in the file that
aren't being handled properly. For instance, one of the characters has an
ASCII code of 150 (it looks like a dash '-'), when I'm debugging this
character is displayed as the square box that Windows uses for unsupported
characters and when it's copied to the database it's stored as '?'.

I've played with the encoding while reading the file but the default
encoding still works the best for all of the data. I can copy this
character to a simple texr editor like Notepad and it's displayed properly.
The problem seems to be that the .net character set used is OEM when what I
want is the ANSI character set. Can anyone help me with reading in all of
the characters in the file. Thanks in advance.

--Paul

>
Hi Paul,
>
It looks like the default encoding is not the correct one. An ANSI
character should be readable in any codepage although it may not display the
correct character. For comparison, ANSI character 150 is Ã» on my system. If
you open the file in Notepad and select Save As ... does it opt for ANSI,
UTF8 or Unicode. It ANSI, do you get the file from another country/system
running potentially other codepages?
>
--
Happy Coding!
Morten Wennevik [C# MVP]

You will indeed get ? characters for extended ascii characters if you try to
read ansi encoded text as ascii. So as Jon pointed out, Encoding.Defaul t may
very well be what you need. Encoding default uses the ansi codepage default
for your locale. To specify a particular codepage use
Encoding.GetEnc oding(nameofenc oding).

--
Happy Coding!
Morten Wennevik [C# MVP]

**Paul W** · Sep 12 '08, 01:35 PM

Re: character sets

"Jon Skeet [C# MVP]" <skeet@pobox.co mwrote in message
news:7685625d-bf9d-43c4-b56c-7f7360401465@d7 7g2000hsb.googl egroups.com...
On Sep 12, 3:05 am, "Paul W" <nos...@pw-review.comwrote :

I have an application that reads data in from a text file and stores it in
a
database. My problem is that there are some characters in the file that
aren't being handled properly. For instance, one of the characters has an
ASCII code of 150 (it looks like a dash '-')

There's no such thing as "ASCII code of 150" - ASCII only goes as far
as 150.

I *suspect* that Encoding.Defaul t is what you're after, but read
http://pobox.com/~skeet/csharp/unicode.html and
http://pobox.com/~skeet/csharp/debuggingunicode.html for more
information.

Jon

I've tried all of the Encoding settings available, Encoding.ASCII gives me
'?', Encoding.UTF8 and Encoding.Defaul t give me the square box and all other
settings give no useful data at all from the file. I'll take a look at
those pages, thanks for sending the links.

--Paul

**Paul W** · Sep 12 '08, 01:45 PM

Re: character sets

"Morten Wennevik [C# MVP]" <MortenWennevik @hotmail.comwro te in message
news:839A7437-1796-405D-8E2E-189A04D9EDD0@mi crosoft.com...

>
"Paul W" wrote:
>

>Hi all,
>>
>I have an application that reads data in from a text file and stores it
>in a
>database. My problem is that there are some characters in the file that
>aren't being handled properly. For instance, one of the characters has
>an
>ASCII code of 150 (it looks like a dash '-'), when I'm debugging this
>character is displayed as the square box that Windows uses for
>unsupported
>characters and when it's copied to the database it's stored as '?'.
>>
>I've played with the encoding while reading the file but the default
>encoding still works the best for all of the data. I can copy this
>character to a simple texr editor like Notepad and it's displayed
>properly.
>The problem seems to be that the .net character set used is OEM when what
>I
>want is the ANSI character set. Can anyone help me with reading in all
>of
>the characters in the file. Thanks in advance.
>>
>--Paul
>>
>>
>>

>
Hi Paul,
>
It looks like the default encoding is not the correct one. An ANSI
character should be readable in any codepage although it may not display
the
correct character. For comparison, ANSI character 150 is û on my system.
If
you open the file in Notepad and select Save As ... does it opt for ANSI,
UTF8 or Unicode. It ANSI, do you get the file from another country/system
running potentially other codepages?
>
--
Happy Coding!
Morten Wennevik [C# MVP]

See my response to Jon regarding the encoding. The reason I mention the
ANSI character set is because I have an editor that provides the character
codes for both OEM and ANSI. OEM shows the same character you are which is
then actually displayed as the square box. ANSI shows character 150 to be
the one actually in the file. This is all very confusing to me but I
believe I've got the correct encoding because the character code I'm
receiving is correct. I believe the problem is the character set. Is there
a way to switch between OEM and ANSI? Thanks for your help.

--Paul

**Jon Skeet [C# MVP]** · Sep 12 '08, 02:45 PM

Re: character sets

On Sep 12, 2:39 pm, "Paul W" <nos...@pw-review.comwrote :

<snip>

See my response to Jon regarding the encoding. The reason I mention the
ANSI character set is because I have an editor that provides the character
codes for both OEM and ANSI. OEM shows the same character you are which is
then actually displayed as the square box. ANSI shows character 150 tobe
the one actually in the file. This is all very confusing to me but I
believe I've got the correct encoding because the character code I'm
receiving is correct. I believe the problem is the character set. Is there
a way to switch between OEM and ANSI? Thanks for your help.

When you say "the character code I'm receiving is correct" what
*exactly* do you mean? If possible, provide a short but complete
example which demonstrates the problem. Obviously in this case *we*
won't be able to run the code because we don't have the file, but it
could still help a lot.

Jon

**Paul W** · Sep 13 '08, 12:25 AM

Re: character sets

"Jon Skeet [C# MVP]" <skeet@pobox.co mwrote in message
news:54772021-4a00-45a3-bb50-ef668a26e0ac@8g 2000hse.googleg roups.com...
On Sep 12, 2:39 pm, "Paul W" <nos...@pw-review.comwrote :

<snip>

See my response to Jon regarding the encoding. The reason I mention the
ANSI character set is because I have an editor that provides the character
codes for both OEM and ANSI. OEM shows the same character you are which is
then actually displayed as the square box. ANSI shows character 150 to be
the one actually in the file. This is all very confusing to me but I
believe I've got the correct encoding because the character code I'm
receiving is correct. I believe the problem is the character set. Is there
a way to switch between OEM and ANSI? Thanks for your help.

When you say "the character code I'm receiving is correct" what
*exactly* do you mean? If possible, provide a short but complete
example which demonstrates the problem. Obviously in this case *we*
won't be able to run the code because we don't have the file, but it
could still help a lot.

Jon

I don't think a sample of code would help here. What I mean by "the
character code I'm receiving is correct" is that the value of 150 that I
mentioned before is the correct value. In the ANSI character set, that
value maps to a character similar to a '-' and this character displays
exactly as expected in other text editors such as Notepad. However, in the
OEM character set, the character code 150 maps to something different
completely and ultimately is displayed as a square box just like all
unsupported characters are displayed in Windows.

I hope I'm making more sense now. The numeric value I'm receiving is the
correct one, the problem is that the character set, OEM, doesn't map that
value to an appropriate character. There are a couple of other characters
in the data files that do this as well. I don't remember the actual values
off hand though. If I could get my program to use the ANSI character set
instead of the OEM character set my problem would be solved.

Thanks again for taking the time to help me work through this problem.

--Paul

**Jon Skeet [C# MVP]** · Sep 13 '08, 01:35 AM

Re: character sets

Paul W <nospam@pw-review.comwrote :

I don't think a sample of code would help here.

Well I really do, I'm afraid.

What I mean by "the character code I'm receiving is correct" is that
the value of 150 that I mentioned before is the correct value.

Where are you getting that value from? If you could show it in code, it
would really help.

In the ANSI character set

Are you aware that there's no one fixed ANSI character encoding?
There's a whole collection of character encodings which use ASCII for
the 7 bit part and then do different things for the next 128 values.

that value maps to a character similar to a '-' and this character
displays exactly as expected in other text editors such as Notepad.
However, in the OEM character set, the character code 150 maps to
something different completely and ultimately is displayed as a
square box just like all unsupported characters are displayed in
Windows.

Unicode 150 (all .NET strings are in Unicode) is a control character
(start of guarded area). So if you're reading

I hope I'm making more sense now.

Not really, because we still need the code.

The numeric value I'm receiving is the correct one

It's not the correct one in Unicode, which is what you need to read in
for .NET. We also don't know what you mean by "the numeric value I'm
receiving" because we don't know how you're reading it.

the problem is that the character set, OEM, doesn't map that
value to an appropriate character.

OEM character encodings aren't getting involved at all here.

There are a couple of other characters
in the data files that do this as well. I don't remember the actual values
off hand though. If I could get my program to use the ANSI character set
instead of the OEM character set my problem would be solved.
>
Thanks again for taking the time to help me work through this problem.

If you could just show us the code you're using to read in the file,
I'm sure we could get to the bottom of it - but without code, there's
nothing I can really suggest other than that using Encoding.Defaul t
probably *will* be the solution when you've got the right code to use
it.

--
Jon Skeet - <skeet@pobox.co m>
Web site: http://www.pobox.com/~skeet
Blog: http://www.msmvps.com/jon.skeet
C# in Depth: http://csharpindepth.com

**Paul W** · Sep 13 '08, 02:15 AM

Re: character sets

"Jon Skeet [C# MVP]" <skeet@pobox.co mwrote in message
news:MPG.233526 a1c69d7695f3a@m snews.microsoft .com...

Paul W <nospam@pw-review.comwrote :

>I don't think a sample of code would help here.

>
Well I really do, I'm afraid.
>

>What I mean by "the character code I'm receiving is correct" is that
>the value of 150 that I mentioned before is the correct value.

>
Where are you getting that value from? If you could show it in code, it
would really help.
>

>In the ANSI character set

>
Are you aware that there's no one fixed ANSI character encoding?
There's a whole collection of character encodings which use ASCII for
the 7 bit part and then do different things for the next 128 values.
>

>that value maps to a character similar to a '-' and this character
>displays exactly as expected in other text editors such as Notepad.
>However, in the OEM character set, the character code 150 maps to
>something different completely and ultimately is displayed as a
>square box just like all unsupported characters are displayed in
>Windows.

>
Unicode 150 (all .NET strings are in Unicode) is a control character
(start of guarded area). So if you're reading
>

>I hope I'm making more sense now.

>
Not really, because we still need the code.
>

>The numeric value I'm receiving is the correct one

>
It's not the correct one in Unicode, which is what you need to read in
for .NET. We also don't know what you mean by "the numeric value I'm
receiving" because we don't know how you're reading it.
>

>the problem is that the character set, OEM, doesn't map that
>value to an appropriate character.

>
OEM character encodings aren't getting involved at all here.
>

>There are a couple of other characters
>in the data files that do this as well. I don't remember the actual
>values
>off hand though. If I could get my program to use the ANSI character set
>instead of the OEM character set my problem would be solved.
>>
>Thanks again for taking the time to help me work through this problem.

>
If you could just show us the code you're using to read in the file,
I'm sure we could get to the bottom of it - but without code, there's
nothing I can really suggest other than that using Encoding.Defaul t
probably *will* be the solution when you've got the right code to use
it.
>
--
Jon Skeet - <skeet@pobox.co m>
Web site: http://www.pobox.com/~skeet
Blog: http://www.msmvps.com/jon.skeet
C# in Depth: http://csharpindepth.com

You were correct Jon, I thought the two following lines of code were the
same:

using (StreamReader sr = new StreamReader(fi leName))

using (StreamReader sr = new StreamReader(fi leName, Encoding.Defaul t))

But they aren't. The second one is working now. I had tried all of the
Encoding choices except the Default one thinking that it would produce the
same results as ommitting encoding. Thanks for all your help.

--Paul

**=?Utf-8?B?TW9ydGVuIFdlbm5ldmlrIFtDIyBNVlBd?=** · Sep 16 '08, 06:55 AM

Re: character sets

"Paul W" wrote:

>
"Jon Skeet [C# MVP]" <skeet@pobox.co mwrote in message
news:MPG.233526 a1c69d7695f3a@m snews.microsoft .com...

Paul W <nospam@pw-review.comwrote :

I don't think a sample of code would help here.

Well I really do, I'm afraid.

What I mean by "the character code I'm receiving is correct" is that
the value of 150 that I mentioned before is the correct value.

Where are you getting that value from? If you could show it in code, it
would really help.

In the ANSI character set

Are you aware that there's no one fixed ANSI character encoding?
There's a whole collection of character encodings which use ASCII for
the 7 bit part and then do different things for the next 128 values.

that value maps to a character similar to a '-' and this character
displays exactly as expected in other text editors such as Notepad.
However, in the OEM character set, the character code 150 maps to
something different completely and ultimately is displayed as a
square box just like all unsupported characters are displayed in
Windows.

Unicode 150 (all .NET strings are in Unicode) is a control character
(start of guarded area). So if you're reading

I hope I'm making more sense now.

Not really, because we still need the code.

The numeric value I'm receiving is the correct one

It's not the correct one in Unicode, which is what you need to read in
for .NET. We also don't know what you mean by "the numeric value I'm
receiving" because we don't know how you're reading it.

the problem is that the character set, OEM, doesn't map that
value to an appropriate character.

OEM character encodings aren't getting involved at all here.

There are a couple of other characters
in the data files that do this as well. I don't remember the actual
values
off hand though. If I could get my program to use the ANSI character set
instead of the OEM character set my problem would be solved.
>
Thanks again for taking the time to help me work through this problem.

If you could just show us the code you're using to read in the file,
I'm sure we could get to the bottom of it - but without code, there's
nothing I can really suggest other than that using Encoding.Defaul t
probably *will* be the solution when you've got the right code to use
it.

--
Jon Skeet - <skeet@pobox.co m>
Web site: http://www.pobox.com/~skeet
Blog: http://www.msmvps.com/jon.skeet
C# in Depth: http://csharpindepth.com

>
You were correct Jon, I thought the two following lines of code were the
same:
>
using (StreamReader sr = new StreamReader(fi leName))
>
using (StreamReader sr = new StreamReader(fi leName, Encoding.Defaul t))
>
>
>
But they aren't. The second one is working now. I had tried all of the
Encoding choices except the Default one thinking that it would produce the
same results as ommitting encoding. Thanks for all your help.
>
>
>
--Paul
>
>

To sum this up, as far as I know, all text reader/writer classes will use
UTF-8 unless told otherwise. If there is an overload taking Encoding as
parameter consider using this overload if the type of encoding is important.

--
Happy Coding!
Morten Wennevik [C# MVP]

character sets

character sets

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment