How to find out a files encoding?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Paw Pedersen

    How to find out a files encoding?

    I'm reading from a file that sometime is saved as ANSI and sometimes is
    saved as UTF-8
    Because it has some danish characters like "æøå" in it, I have to read it
    with "windows-1252" encoding if is saved as ANSI:
    Encoding ediEncoding = Encoding.GetEnc oding( "windows-1252" );

    StreamReader streamReader = new StreamReader(or iginalStrm,ediE ncoding);

    But if it's saved as UTF-8 it have to be read with the default encoding.

    Is there a way to find out how the file is saved, so I can programatical
    choose the right encoding?

    Tanks in advance.

    Regards Paw


  • Jon Skeet [C# MVP]

    #2
    Re: How to find out a files encoding?

    Paw Pedersen <news@paws.dk > wrote:[color=blue]
    > I'm reading from a file that sometime is saved as ANSI and sometimes is
    > saved as UTF-8
    > Because it has some danish characters like "æøå" in it, I have to read it
    > with "windows-1252" encoding if is saved as ANSI:
    > Encoding ediEncoding = Encoding.GetEnc oding( "windows-1252" );
    >
    > StreamReader streamReader = new StreamReader(or iginalStrm,ediE ncoding);
    >
    > But if it's saved as UTF-8 it have to be read with the default encoding.
    >
    > Is there a way to find out how the file is saved, so I can programatical
    > choose the right encoding?[/color]

    Not really. Every valid UTF-8 file is also a Windows-1252 file. There
    are ways you could *guess*, but they won't always work.

    --
    Jon Skeet - <skeet@pobox.co m>
    Pobox has been discontinued as a separate service, and all existing customers moved to the Fastmail platform.

    If replying to the group, please do not mail me too

    Comment

    • Paw Pedersen

      #3
      Re: How to find out a files encoding?

      Tanks. Could you give me an idea of how to guess?

      Regards Paw

      "Jon Skeet [C# MVP]" <skeet@pobox.co m> wrote in message
      news:MPG.1bc360 dbea88ac2e98b52 d@msnews.micros oft.com...
      Paw Pedersen <news@paws.dk > wrote:[color=blue]
      > I'm reading from a file that sometime is saved as ANSI and sometimes is
      > saved as UTF-8
      > Because it has some danish characters like "æøå" in it, I have to read it
      > with "windows-1252" encoding if is saved as ANSI:
      > Encoding ediEncoding = Encoding.GetEnc oding( "windows-1252" );
      >
      > StreamReader streamReader = new StreamReader(or iginalStrm,ediE ncoding);
      >
      > But if it's saved as UTF-8 it have to be read with the default encoding.
      >
      > Is there a way to find out how the file is saved, so I can programatical
      > choose the right encoding?[/color]

      Not really. Every valid UTF-8 file is also a Windows-1252 file. There
      are ways you could *guess*, but they won't always work.

      --
      Jon Skeet - <skeet@pobox.co m>
      Pobox has been discontinued as a separate service, and all existing customers moved to the Fastmail platform.

      If replying to the group, please do not mail me too


      Comment

      • Morten Wennevik

        #4
        Re: How to find out a files encoding?

        Hi Paw,

        Just a thought, if you read using UTF-8 and find no æ ø or å, you probably should have used windows-1252.

        --
        Happy coding!
        Morten Wennevik [C# MVP]

        Comment

        • Paw Pedersen

          #5
          Re: How to find out a files encoding?

          I don't know if there should be any æ ø or å in the file.
          If it is correct that: "Every valid UTF-8 file is also a Windows-1252 file"
          maybe I could just always read it with "windows-1252" encoding and then make
          a replace of the special chars that represent æ,ø and å? Fx the char "ø" is
          converted to "Ã?" when an utf-8 file is read with "windows-1252" encoding.
          But is there other chars that will be translated incorrect?

          "Morten Wennevik" <MortenWennevik @hotmail.com> wrote in message
          news:opse1fu6y7 klbvpo@stone...[color=blue]
          > Hi Paw,
          >
          > Just a thought, if you read using UTF-8 and find no æ ø or å, you probably[/color]
          should have used windows-1252.[color=blue]
          >
          > --
          > Happy coding!
          > Morten Wennevik [C# MVP][/color]


          Comment

          • Jon Skeet [C# MVP]

            #6
            Re: How to find out a files encoding?

            Paw Pedersen <news@paws.dk > wrote:[color=blue]
            > Tanks. Could you give me an idea of how to guess?[/color]

            Well, if the file starts with the UTF-8 version of the byte order mark,
            it's likely to be UTF-8.

            If you find invalid UTF-8 sequences, it can't be UTF-8 (assuming a
            valid UTF-8 encoder).

            You may want to have a look at Windows-1252 and guess at which
            characters aren't likely to crop up in valid files, too.

            See http://www.cl.cam.ac.uk/~mgk25/unicode.html for more information
            about UTF-8.

            --
            Jon Skeet - <skeet@pobox.co m>
            Pobox has been discontinued as a separate service, and all existing customers moved to the Fastmail platform.

            If replying to the group, please do not mail me too

            Comment

            • Jon Skeet [C# MVP]

              #7
              Re: How to find out a files encoding?

              Paw Pedersen <news@paws.dk > wrote:[color=blue]
              > I don't know if there should be any æ ø or å in the file.
              > If it is correct that: "Every valid UTF-8 file is also a Windows-1252 file"
              > maybe I could just always read it with "windows-1252" encoding and then make
              > a replace of the special chars that represent æ,ø and å? Fx the char "ø" is
              > converted to "Ã?" when an utf-8 file is read with "windows-1252" encoding.
              > But is there other chars that will be translated incorrect?[/color]

              That seems a very bad idea to me. It *might* work, but it's asking for
              trouble. Whenever you decode binary data into text using the wrong
              decoding, you're likely to get burned. Just MHO though.

              --
              Jon Skeet - <skeet@pobox.co m>
              Pobox has been discontinued as a separate service, and all existing customers moved to the Fastmail platform.

              If replying to the group, please do not mail me too

              Comment

              Working...