character sets

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Paul W

    character sets

    Hi all,

    I have an application that reads data in from a text file and stores it in a
    database. My problem is that there are some characters in the file that
    aren't being handled properly. For instance, one of the characters has an
    ASCII code of 150 (it looks like a dash '-'), when I'm debugging this
    character is displayed as the square box that Windows uses for unsupported
    characters and when it's copied to the database it's stored as '?'.

    I've played with the encoding while reading the file but the default
    encoding still works the best for all of the data. I can copy this
    character to a simple texr editor like Notepad and it's displayed properly.
    The problem seems to be that the .net character set used is OEM when what I
    want is the ANSI character set. Can anyone help me with reading in all of
    the characters in the file. Thanks in advance.

    --Paul


  • Jon Skeet [C# MVP]

    #2
    Re: character sets

    On Sep 12, 3:05 am, "Paul W" <nos...@pw-review.comwrote :
    I have an application that reads data in from a text file and stores it in a
    database.  My problem is that there are some characters in the file that
    aren't being handled properly.  For instance, one of the characters hasan
    ASCII code of 150 (it looks like a dash '-')
    There's no such thing as "ASCII code of 150" - ASCII only goes as far
    as 150.

    I *suspect* that Encoding.Defaul t is what you're after, but read
    http://pobox.com/~skeet/csharp/unicode.html and
    http://pobox.com/~skeet/csharp/debuggingunicode.html for more
    information.

    Jon

    Comment

    • =?Utf-8?B?TW9ydGVuIFdlbm5ldmlrIFtDIyBNVlBd?=

      #3
      RE: character sets


      "Paul W" wrote:
      Hi all,
      >
      I have an application that reads data in from a text file and stores it in a
      database. My problem is that there are some characters in the file that
      aren't being handled properly. For instance, one of the characters has an
      ASCII code of 150 (it looks like a dash '-'), when I'm debugging this
      character is displayed as the square box that Windows uses for unsupported
      characters and when it's copied to the database it's stored as '?'.
      >
      I've played with the encoding while reading the file but the default
      encoding still works the best for all of the data. I can copy this
      character to a simple texr editor like Notepad and it's displayed properly.
      The problem seems to be that the .net character set used is OEM when what I
      want is the ANSI character set. Can anyone help me with reading in all of
      the characters in the file. Thanks in advance.
      >
      --Paul
      >
      >
      >
      Hi Paul,

      It looks like the default encoding is not the correct one. An ANSI
      character should be readable in any codepage although it may not display the
      correct character. For comparison, ANSI character 150 is û on my system. If
      you open the file in Notepad and select Save As ... does it opt for ANSI,
      UTF8 or Unicode. It ANSI, do you get the file from another country/system
      running potentially other codepages?

      --
      Happy Coding!
      Morten Wennevik [C# MVP]

      Comment

      • =?Utf-8?B?TW9ydGVuIFdlbm5ldmlrIFtDIyBNVlBd?=

        #4
        RE: character sets


        "Morten Wennevik [C# MVP]" wrote:
        >
        "Paul W" wrote:
        >
        Hi all,

        I have an application that reads data in from a text file and stores it in a
        database. My problem is that there are some characters in the file that
        aren't being handled properly. For instance, one of the characters has an
        ASCII code of 150 (it looks like a dash '-'), when I'm debugging this
        character is displayed as the square box that Windows uses for unsupported
        characters and when it's copied to the database it's stored as '?'.

        I've played with the encoding while reading the file but the default
        encoding still works the best for all of the data. I can copy this
        character to a simple texr editor like Notepad and it's displayed properly.
        The problem seems to be that the .net character set used is OEM when what I
        want is the ANSI character set. Can anyone help me with reading in all of
        the characters in the file. Thanks in advance.

        --Paul

        >
        Hi Paul,
        >
        It looks like the default encoding is not the correct one. An ANSI
        character should be readable in any codepage although it may not display the
        correct character. For comparison, ANSI character 150 is û on my system. If
        you open the file in Notepad and select Save As ... does it opt for ANSI,
        UTF8 or Unicode. It ANSI, do you get the file from another country/system
        running potentially other codepages?
        >
        --
        Happy Coding!
        Morten Wennevik [C# MVP]
        You will indeed get ? characters for extended ascii characters if you try to
        read ansi encoded text as ascii. So as Jon pointed out, Encoding.Defaul t may
        very well be what you need. Encoding default uses the ansi codepage default
        for your locale. To specify a particular codepage use
        Encoding.GetEnc oding(nameofenc oding).

        --
        Happy Coding!
        Morten Wennevik [C# MVP]

        Comment

        • Paul W

          #5
          Re: character sets


          "Jon Skeet [C# MVP]" <skeet@pobox.co mwrote in message
          news:7685625d-bf9d-43c4-b56c-7f7360401465@d7 7g2000hsb.googl egroups.com...
          On Sep 12, 3:05 am, "Paul W" <nos...@pw-review.comwrote :
          I have an application that reads data in from a text file and stores it in
          a
          database. My problem is that there are some characters in the file that
          aren't being handled properly. For instance, one of the characters has an
          ASCII code of 150 (it looks like a dash '-')
          There's no such thing as "ASCII code of 150" - ASCII only goes as far
          as 150.

          I *suspect* that Encoding.Defaul t is what you're after, but read
          http://pobox.com/~skeet/csharp/unicode.html and
          http://pobox.com/~skeet/csharp/debuggingunicode.html for more
          information.

          Jon

          I've tried all of the Encoding settings available, Encoding.ASCII gives me
          '?', Encoding.UTF8 and Encoding.Defaul t give me the square box and all other
          settings give no useful data at all from the file. I'll take a look at
          those pages, thanks for sending the links.

          --Paul


          Comment

          • Paul W

            #6
            Re: character sets

            "Morten Wennevik [C# MVP]" <MortenWennevik @hotmail.comwro te in message
            news:839A7437-1796-405D-8E2E-189A04D9EDD0@mi crosoft.com...
            >
            "Paul W" wrote:
            >
            >Hi all,
            >>
            >I have an application that reads data in from a text file and stores it
            >in a
            >database. My problem is that there are some characters in the file that
            >aren't being handled properly. For instance, one of the characters has
            >an
            >ASCII code of 150 (it looks like a dash '-'), when I'm debugging this
            >character is displayed as the square box that Windows uses for
            >unsupported
            >characters and when it's copied to the database it's stored as '?'.
            >>
            >I've played with the encoding while reading the file but the default
            >encoding still works the best for all of the data. I can copy this
            >character to a simple texr editor like Notepad and it's displayed
            >properly.
            >The problem seems to be that the .net character set used is OEM when what
            >I
            >want is the ANSI character set. Can anyone help me with reading in all
            >of
            >the characters in the file. Thanks in advance.
            >>
            >--Paul
            >>
            >>
            >>
            >
            Hi Paul,
            >
            It looks like the default encoding is not the correct one. An ANSI
            character should be readable in any codepage although it may not display
            the
            correct character. For comparison, ANSI character 150 is û on my system.
            If
            you open the file in Notepad and select Save As ... does it opt for ANSI,
            UTF8 or Unicode. It ANSI, do you get the file from another country/system
            running potentially other codepages?
            >
            --
            Happy Coding!
            Morten Wennevik [C# MVP]
            See my response to Jon regarding the encoding. The reason I mention the
            ANSI character set is because I have an editor that provides the character
            codes for both OEM and ANSI. OEM shows the same character you are which is
            then actually displayed as the square box. ANSI shows character 150 to be
            the one actually in the file. This is all very confusing to me but I
            believe I've got the correct encoding because the character code I'm
            receiving is correct. I believe the problem is the character set. Is there
            a way to switch between OEM and ANSI? Thanks for your help.

            --Paul


            Comment

            • Jon Skeet [C# MVP]

              #7
              Re: character sets

              On Sep 12, 2:39 pm, "Paul W" <nos...@pw-review.comwrote :

              <snip>
              See my response to Jon regarding the encoding.  The reason I mention the
              ANSI character set is because I have an editor that provides the character
              codes for both OEM and ANSI.  OEM shows the same character you are which is
              then actually displayed as the square box.  ANSI shows character 150 tobe
              the one actually in the file.  This is all very confusing to me but I
              believe I've got the correct encoding because the character code I'm
              receiving is correct.  I believe the problem is the character set.  Is there
              a way to switch between OEM and ANSI?  Thanks for your help.
              When you say "the character code I'm receiving is correct" what
              *exactly* do you mean? If possible, provide a short but complete
              example which demonstrates the problem. Obviously in this case *we*
              won't be able to run the code because we don't have the file, but it
              could still help a lot.

              Jon

              Comment

              • Paul W

                #8
                Re: character sets


                "Jon Skeet [C# MVP]" <skeet@pobox.co mwrote in message
                news:54772021-4a00-45a3-bb50-ef668a26e0ac@8g 2000hse.googleg roups.com...
                On Sep 12, 2:39 pm, "Paul W" <nos...@pw-review.comwrote :

                <snip>
                See my response to Jon regarding the encoding. The reason I mention the
                ANSI character set is because I have an editor that provides the character
                codes for both OEM and ANSI. OEM shows the same character you are which is
                then actually displayed as the square box. ANSI shows character 150 to be
                the one actually in the file. This is all very confusing to me but I
                believe I've got the correct encoding because the character code I'm
                receiving is correct. I believe the problem is the character set. Is there
                a way to switch between OEM and ANSI? Thanks for your help.
                When you say "the character code I'm receiving is correct" what
                *exactly* do you mean? If possible, provide a short but complete
                example which demonstrates the problem. Obviously in this case *we*
                won't be able to run the code because we don't have the file, but it
                could still help a lot.

                Jon

                I don't think a sample of code would help here. What I mean by "the
                character code I'm receiving is correct" is that the value of 150 that I
                mentioned before is the correct value. In the ANSI character set, that
                value maps to a character similar to a '-' and this character displays
                exactly as expected in other text editors such as Notepad. However, in the
                OEM character set, the character code 150 maps to something different
                completely and ultimately is displayed as a square box just like all
                unsupported characters are displayed in Windows.

                I hope I'm making more sense now. The numeric value I'm receiving is the
                correct one, the problem is that the character set, OEM, doesn't map that
                value to an appropriate character. There are a couple of other characters
                in the data files that do this as well. I don't remember the actual values
                off hand though. If I could get my program to use the ANSI character set
                instead of the OEM character set my problem would be solved.

                Thanks again for taking the time to help me work through this problem.

                --Paul


                Comment

                • Jon Skeet [C# MVP]

                  #9
                  Re: character sets

                  Paul W <nospam@pw-review.comwrote :
                  I don't think a sample of code would help here.
                  Well I really do, I'm afraid.
                  What I mean by "the character code I'm receiving is correct" is that
                  the value of 150 that I mentioned before is the correct value.
                  Where are you getting that value from? If you could show it in code, it
                  would really help.
                  In the ANSI character set
                  Are you aware that there's no one fixed ANSI character encoding?
                  There's a whole collection of character encodings which use ASCII for
                  the 7 bit part and then do different things for the next 128 values.
                  that value maps to a character similar to a '-' and this character
                  displays exactly as expected in other text editors such as Notepad.
                  However, in the OEM character set, the character code 150 maps to
                  something different completely and ultimately is displayed as a
                  square box just like all unsupported characters are displayed in
                  Windows.
                  Unicode 150 (all .NET strings are in Unicode) is a control character
                  (start of guarded area). So if you're reading
                  I hope I'm making more sense now.
                  Not really, because we still need the code.
                  The numeric value I'm receiving is the correct one
                  It's not the correct one in Unicode, which is what you need to read in
                  for .NET. We also don't know what you mean by "the numeric value I'm
                  receiving" because we don't know how you're reading it.
                  the problem is that the character set, OEM, doesn't map that
                  value to an appropriate character.
                  OEM character encodings aren't getting involved at all here.
                  There are a couple of other characters
                  in the data files that do this as well. I don't remember the actual values
                  off hand though. If I could get my program to use the ANSI character set
                  instead of the OEM character set my problem would be solved.
                  >
                  Thanks again for taking the time to help me work through this problem.
                  If you could just show us the code you're using to read in the file,
                  I'm sure we could get to the bottom of it - but without code, there's
                  nothing I can really suggest other than that using Encoding.Defaul t
                  probably *will* be the solution when you've got the right code to use
                  it.

                  --
                  Jon Skeet - <skeet@pobox.co m>
                  Web site: http://www.pobox.com/~skeet
                  Blog: http://www.msmvps.com/jon.skeet
                  C# in Depth: http://csharpindepth.com

                  Comment

                  • Paul W

                    #10
                    Re: character sets


                    "Jon Skeet [C# MVP]" <skeet@pobox.co mwrote in message
                    news:MPG.233526 a1c69d7695f3a@m snews.microsoft .com...
                    Paul W <nospam@pw-review.comwrote :
                    >I don't think a sample of code would help here.
                    >
                    Well I really do, I'm afraid.
                    >
                    >What I mean by "the character code I'm receiving is correct" is that
                    >the value of 150 that I mentioned before is the correct value.
                    >
                    Where are you getting that value from? If you could show it in code, it
                    would really help.
                    >
                    >In the ANSI character set
                    >
                    Are you aware that there's no one fixed ANSI character encoding?
                    There's a whole collection of character encodings which use ASCII for
                    the 7 bit part and then do different things for the next 128 values.
                    >
                    >that value maps to a character similar to a '-' and this character
                    >displays exactly as expected in other text editors such as Notepad.
                    >However, in the OEM character set, the character code 150 maps to
                    >something different completely and ultimately is displayed as a
                    >square box just like all unsupported characters are displayed in
                    >Windows.
                    >
                    Unicode 150 (all .NET strings are in Unicode) is a control character
                    (start of guarded area). So if you're reading
                    >
                    >I hope I'm making more sense now.
                    >
                    Not really, because we still need the code.
                    >
                    >The numeric value I'm receiving is the correct one
                    >
                    It's not the correct one in Unicode, which is what you need to read in
                    for .NET. We also don't know what you mean by "the numeric value I'm
                    receiving" because we don't know how you're reading it.
                    >
                    >the problem is that the character set, OEM, doesn't map that
                    >value to an appropriate character.
                    >
                    OEM character encodings aren't getting involved at all here.
                    >
                    >There are a couple of other characters
                    >in the data files that do this as well. I don't remember the actual
                    >values
                    >off hand though. If I could get my program to use the ANSI character set
                    >instead of the OEM character set my problem would be solved.
                    >>
                    >Thanks again for taking the time to help me work through this problem.
                    >
                    If you could just show us the code you're using to read in the file,
                    I'm sure we could get to the bottom of it - but without code, there's
                    nothing I can really suggest other than that using Encoding.Defaul t
                    probably *will* be the solution when you've got the right code to use
                    it.
                    >
                    --
                    Jon Skeet - <skeet@pobox.co m>
                    Web site: http://www.pobox.com/~skeet
                    Blog: http://www.msmvps.com/jon.skeet
                    C# in Depth: http://csharpindepth.com
                    You were correct Jon, I thought the two following lines of code were the
                    same:

                    using (StreamReader sr = new StreamReader(fi leName))

                    using (StreamReader sr = new StreamReader(fi leName, Encoding.Defaul t))



                    But they aren't. The second one is working now. I had tried all of the
                    Encoding choices except the Default one thinking that it would produce the
                    same results as ommitting encoding. Thanks for all your help.



                    --Paul




                    Comment

                    • =?Utf-8?B?TW9ydGVuIFdlbm5ldmlrIFtDIyBNVlBd?=

                      #11
                      Re: character sets


                      "Paul W" wrote:
                      >
                      "Jon Skeet [C# MVP]" <skeet@pobox.co mwrote in message
                      news:MPG.233526 a1c69d7695f3a@m snews.microsoft .com...
                      Paul W <nospam@pw-review.comwrote :
                      I don't think a sample of code would help here.
                      Well I really do, I'm afraid.
                      What I mean by "the character code I'm receiving is correct" is that
                      the value of 150 that I mentioned before is the correct value.
                      Where are you getting that value from? If you could show it in code, it
                      would really help.
                      In the ANSI character set
                      Are you aware that there's no one fixed ANSI character encoding?
                      There's a whole collection of character encodings which use ASCII for
                      the 7 bit part and then do different things for the next 128 values.
                      that value maps to a character similar to a '-' and this character
                      displays exactly as expected in other text editors such as Notepad.
                      However, in the OEM character set, the character code 150 maps to
                      something different completely and ultimately is displayed as a
                      square box just like all unsupported characters are displayed in
                      Windows.
                      Unicode 150 (all .NET strings are in Unicode) is a control character
                      (start of guarded area). So if you're reading
                      I hope I'm making more sense now.
                      Not really, because we still need the code.
                      The numeric value I'm receiving is the correct one
                      It's not the correct one in Unicode, which is what you need to read in
                      for .NET. We also don't know what you mean by "the numeric value I'm
                      receiving" because we don't know how you're reading it.
                      the problem is that the character set, OEM, doesn't map that
                      value to an appropriate character.
                      OEM character encodings aren't getting involved at all here.
                      There are a couple of other characters
                      in the data files that do this as well. I don't remember the actual
                      values
                      off hand though. If I could get my program to use the ANSI character set
                      instead of the OEM character set my problem would be solved.
                      >
                      Thanks again for taking the time to help me work through this problem.
                      If you could just show us the code you're using to read in the file,
                      I'm sure we could get to the bottom of it - but without code, there's
                      nothing I can really suggest other than that using Encoding.Defaul t
                      probably *will* be the solution when you've got the right code to use
                      it.

                      --
                      Jon Skeet - <skeet@pobox.co m>
                      Web site: http://www.pobox.com/~skeet
                      Blog: http://www.msmvps.com/jon.skeet
                      C# in Depth: http://csharpindepth.com
                      >
                      You were correct Jon, I thought the two following lines of code were the
                      same:
                      >
                      using (StreamReader sr = new StreamReader(fi leName))
                      >
                      using (StreamReader sr = new StreamReader(fi leName, Encoding.Defaul t))
                      >
                      >
                      >
                      But they aren't. The second one is working now. I had tried all of the
                      Encoding choices except the Default one thinking that it would produce the
                      same results as ommitting encoding. Thanks for all your help.
                      >
                      >
                      >
                      --Paul
                      >
                      >
                      To sum this up, as far as I know, all text reader/writer classes will use
                      UTF-8 unless told otherwise. If there is an overload taking Encoding as
                      parameter consider using this overload if the type of encoding is important.

                      --
                      Happy Coding!
                      Morten Wennevik [C# MVP]

                      Comment

                      Working...