Determine File Encoding

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Marc Jennings

    Determine File Encoding

    Hi there,

    Can anyone point out any really obvious flaws in the methodology below
    to determine the likely encoding of a file, please? I know the number
    of types of encoding is small, but that is only because the
    possibilities I need to work with is a small list.
    [color=blue]
    > private string determineFileEn coding(FileStre am strm)
    > {
    > long originalSize = strm.Length;
    > StreamReader rdr = new StreamReader(st rm);
    >
    > strm.Position = 0;
    > System.Text.UTF 8Encoding unic = new System.Text.UTF 8Encoding();
    > byte[] inputFile = unic.GetBytes(r dr.ReadToEnd()) ;
    > if(inputFile.Le ngth == originalSize)
    > {
    > return "UTF8";
    > }
    >
    > strm.Position = 0;
    > System.Text.Uni codeEncoding unic2 = new System.Text.Uni codeEncoding();
    > byte[] inputFile2 = unic2.GetBytes( rdr.ReadToEnd() );
    > if(inputFile2.L ength == originalSize)
    > {
    > return "Unicode";
    > }
    >
    > strm.Position = 0;
    > System.Text.UTF 7Encoding unic3 = new System.Text.UTF 7Encoding();
    > byte[] inputFile3 = unic3.GetBytes( rdr.ReadToEnd() );
    > if(inputFile3.L ength == originalSize)
    > {
    > return "UTF7";
    > }
    >
    > System.Text.ASC IIEncoding unic4 = new System.Text.ASC IIEncoding();
    > byte[] inputFile4 = unic3.GetBytes( rdr.ReadToEnd() );
    > if(inputFile4.L ength == originalSize)
    > {
    > return "Ascii";
    > }
    >
    > return "Not known";
    > }[/color]

    Thanks in advance
    Marc.
  • Nick Malik [Microsoft]

    #2
    Re: Determine File Encoding

    Why read the entire file to determine the encoding. Can't you tell from the
    indicator bytes at the beginning?

    Forgive me if I don't know much about encoding, but your algorithm appears
    wildly inefficient on its face.

    --
    --- Nick Malik [Microsoft]
    MCSD, CFPS, Certified Scrummaster


    Disclaimer: Opinions expressed in this forum are my own, and not
    representative of my employer.
    I do not answer questions on behalf of my employer. I'm just a
    programmer helping programmers.
    --
    "Marc Jennings" <MarcJennings@c ommunity.nospam > wrote in message
    news:ch2r91h29u js5vv0eq06sehnn 0f96l61aj@4ax.c om...[color=blue]
    > Hi there,
    >
    > Can anyone point out any really obvious flaws in the methodology below
    > to determine the likely encoding of a file, please? I know the number
    > of types of encoding is small, but that is only because the
    > possibilities I need to work with is a small list.
    >[color=green]
    >> private string determineFileEn coding(FileStre am strm)
    >> {
    >> long originalSize = strm.Length;
    >> StreamReader rdr = new StreamReader(st rm);
    >>
    >> strm.Position = 0;
    >> System.Text.UTF 8Encoding unic = new System.Text.UTF 8Encoding();
    >> byte[] inputFile = unic.GetBytes(r dr.ReadToEnd()) ;
    >> if(inputFile.Le ngth == originalSize)
    >> {
    >> return "UTF8";
    >> }
    >>
    >> strm.Position = 0;
    >> System.Text.Uni codeEncoding unic2 = new System.Text.Uni codeEncoding();
    >> byte[] inputFile2 = unic2.GetBytes( rdr.ReadToEnd() );
    >> if(inputFile2.L ength == originalSize)
    >> {
    >> return "Unicode";
    >> }
    >>
    >> strm.Position = 0;
    >> System.Text.UTF 7Encoding unic3 = new System.Text.UTF 7Encoding();
    >> byte[] inputFile3 = unic3.GetBytes( rdr.ReadToEnd() );
    >> if(inputFile3.L ength == originalSize)
    >> {
    >> return "UTF7";
    >> }
    >>
    >> System.Text.ASC IIEncoding unic4 = new System.Text.ASC IIEncoding();
    >> byte[] inputFile4 = unic3.GetBytes( rdr.ReadToEnd() );
    >> if(inputFile4.L ength == originalSize)
    >> {
    >> return "Ascii";
    >> }
    >>
    >> return "Not known";
    >> }[/color]
    >
    > Thanks in advance
    > Marc.[/color]


    Comment

    • Marc Jennings

      #3
      Re: Determine File Encoding

      I have to forgive you for not knowing too much about encoding. I know
      even less. I agree that the algorithm *is* wildly inneficient, but
      the fact is that I have not got a clue. :-) Such are the joys of
      learning from Google.

      On Wed, 1 Jun 2005 06:27:22 -0700, "Nick Malik [Microsoft]"
      <nickmalik@hotm ail.nospam.com> wrote:
      [color=blue]
      >Why read the entire file to determine the encoding. Can't you tell from the
      >indicator bytes at the beginning?
      >
      >Forgive me if I don't know much about encoding, but your algorithm appears
      >wildly inefficient on its face.
      >
      >--
      >--- Nick Malik [Microsoft]
      > MCSD, CFPS, Certified Scrummaster
      > http://blogs.msdn.com/nickmalik
      >
      >Disclaimer: Opinions expressed in this forum are my own, and not
      >representati ve of my employer.
      > I do not answer questions on behalf of my employer. I'm just a
      >programmer helping programmers.[/color]

      Comment

      • KH

        #4
        Re: Determine File Encoding

        Check out the StreamReader constructors that take a bool argument to
        determine the encoding from the byte order mark. Also check out the
        Encoding.GetPre amble() method.


        "Marc Jennings" wrote:
        [color=blue]
        > I have to forgive you for not knowing too much about encoding. I know
        > even less. I agree that the algorithm *is* wildly inneficient, but
        > the fact is that I have not got a clue. :-) Such are the joys of
        > learning from Google.
        >
        > On Wed, 1 Jun 2005 06:27:22 -0700, "Nick Malik [Microsoft]"
        > <nickmalik@hotm ail.nospam.com> wrote:
        >[color=green]
        > >Why read the entire file to determine the encoding. Can't you tell from the
        > >indicator bytes at the beginning?
        > >
        > >Forgive me if I don't know much about encoding, but your algorithm appears
        > >wildly inefficient on its face.
        > >
        > >--
        > >--- Nick Malik [Microsoft]
        > > MCSD, CFPS, Certified Scrummaster
        > > http://blogs.msdn.com/nickmalik
        > >
        > >Disclaimer: Opinions expressed in this forum are my own, and not
        > >representati ve of my employer.
        > > I do not answer questions on behalf of my employer. I'm just a
        > >programmer helping programmers.[/color]
        >
        >[/color]

        Comment

        • Joerg Jooss

          #5
          Re: Determine File Encoding

          Marc Jennings wrote:
          [color=blue]
          > Hi there,
          >
          > Can anyone point out any really obvious flaws in the methodology below
          > to determine the likely encoding of a file, please? I know the number
          > of types of encoding is small, but that is only because the
          > possibilities I need to work with is a small list.
          >[color=green]
          > > private string determineFileEn coding(FileStre am strm)
          > > {
          > > long originalSize = strm.Length;
          > > StreamReader rdr = new StreamReader(st rm);
          > >
          > > strm.Position = 0;
          > > System.Text.UTF 8Encoding unic = new System.Text.UTF 8Encoding();
          > > byte[] inputFile = unic.GetBytes(r dr.ReadToEnd()) ;
          > > if(inputFile.Le ngth == originalSize)
          > > {
          > > return "UTF8";
          > > }
          > >
          > > strm.Position = 0;
          > > System.Text.Uni codeEncoding unic2 = new
          > > System.Text.Uni codeEncoding(); byte[] inputFile2 =
          > > unic2.GetBytes( rdr.ReadToEnd() ); if(inputFile2.L ength ==
          > > originalSize) {
          > > return "Unicode";
          > > }
          > >
          > > strm.Position = 0;
          > > System.Text.UTF 7Encoding unic3 = new System.Text.UTF 7Encoding();
          > > byte[] inputFile3 = unic3.GetBytes( rdr.ReadToEnd() );
          > > if(inputFile3.L ength == originalSize)
          > > {
          > > return "UTF7";
          > > }
          > >
          > > System.Text.ASC IIEncoding unic4 = new System.Text.ASC IIEncoding();
          > > byte[] inputFile4 = unic3.GetBytes( rdr.ReadToEnd() );
          > > if(inputFile4.L ength == originalSize)
          > > {
          > > return "Ascii";
          > > }
          > >
          > > return "Not known";
          > > }[/color][/color]

          The most obvious flaw would be that generally speaking this is
          impossible to achieve ;-)

          The second flaw is that your code is just plain wrong. You're using a
          UTF-8 StreamReader regardless of the actual encoding. This object will
          be able to read UTF-8 and ASCII, but UTF-16 will break for sure.

          The third flaw is that you assume "the number of types of encoding is
          small". I'd say

          icode_81rn.asp is not really a short list, although many of these
          encodings are not likely to be found in your typical American or
          Western European PC environment.

          Cheers,
          --

          mailto:news-reply@joergjoos s.de

          Comment

          • Joerg Jooss

            #6
            Re: Determine File Encoding

            KH wrote:
            [color=blue]
            > Check out the StreamReader constructors that take a bool argument to
            > determine the encoding from the byte order mark. Also check out the
            > Encoding.GetPre amble() method.[/color]

            That works only for certain UTFs and maybe some rather obscure stuff,
            but today's popular 8 bit encodings like ISO-8859-x or Windows-152x
            don't use preambles or BOMs.

            Cheers,
            --

            mailto:news-reply@joergjoos s.de

            Comment

            • Marc Jennings

              #7
              Re: Determine File Encoding

              On Wed, 01 Jun 2005 11:59:09 -0700, "Joerg Jooss"
              <news-reply@joergjoos s.de> wrote:

              **snip**[color=blue]
              >
              >The most obvious flaw would be that generally speaking this is
              >impossible to achieve ;-)
              >
              >The second flaw is that your code is just plain wrong. You're using a
              >UTF-8 StreamReader regardless of the actual encoding. This object will
              >be able to read UTF-8 and ASCII, but UTF-16 will break for sure.
              >
              >The third flaw is that you assume "the number of types of encoding is
              >small". I'd say
              >http://msdn.microsoft.com/library/de.../en-us/intl/un
              >icode_81rn.a sp is not really a short list, although many of these
              >encodings are not likely to be found in your typical American or
              >Western European PC environment.
              >
              >Cheers,[/color]

              Agreed in the general case, but perhaps I should have made my
              situation a little clearer. The files that I need to deal with will
              only be one of a very small subset of all the possible encodings out
              there.

              At least now I know my thinking is more flawed than I though it
              was....

              Comment

              • cody

                #8
                Re: Determine File Encoding

                There is no way to determine the encoding of the file unless you know
                exactly the text which you expect in the file or there are marker bytes in
                the file or a special file extension.
                But you can try to use a statistic approach. If the bytes on even positions
                are mostly bigger than bytes on uneven positions (or was it the other way
                around?) you have unicode. if there are no null chars and no chars < ascii
                #32 except \r and \n you have certainly ascii encoding.
                In all other cases you may have UTF8.

                "Marc Jennings" <MarcJennings@c ommunity.nospam > schrieb im Newsbeitrag
                news:5650a11rfc 4016bcn4jdtn0v1 nnhqd9sv6@4ax.c om...[color=blue]
                > On Wed, 01 Jun 2005 11:59:09 -0700, "Joerg Jooss"
                > <news-reply@joergjoos s.de> wrote:
                >
                > **snip**[color=green]
                > >
                > >The most obvious flaw would be that generally speaking this is
                > >impossible to achieve ;-)
                > >
                > >The second flaw is that your code is just plain wrong. You're using a
                > >UTF-8 StreamReader regardless of the actual encoding. This object will
                > >be able to read UTF-8 and ASCII, but UTF-16 will break for sure.
                > >
                > >The third flaw is that you assume "the number of types of encoding is
                > >small". I'd say
                > >http://msdn.microsoft.com/library/de.../en-us/intl/un
                > >icode_81rn.a sp is not really a short list, although many of these
                > >encodings are not likely to be found in your typical American or
                > >Western European PC environment.
                > >
                > >Cheers,[/color]
                >
                > Agreed in the general case, but perhaps I should have made my
                > situation a little clearer. The files that I need to deal with will
                > only be one of a very small subset of all the possible encodings out
                > there.
                >
                > At least now I know my thinking is more flawed than I though it
                > was....[/color]


                Comment

                • Joerg Jooss

                  #9
                  Re: Determine File Encoding

                  Marc Jennings wrote:
                  [color=blue]
                  > On Wed, 01 Jun 2005 11:59:09 -0700, "Joerg Jooss"
                  > <news-reply@joergjoos s.de> wrote:
                  >
                  > **snip**[color=green]
                  > >
                  > > The most obvious flaw would be that generally speaking this is
                  > > impossible to achieve ;-)
                  > >
                  > > The second flaw is that your code is just plain wrong. You're using
                  > > a UTF-8 StreamReader regardless of the actual encoding. This object
                  > > will be able to read UTF-8 and ASCII, but UTF-16 will break for
                  > > sure.
                  > >
                  > > The third flaw is that you assume "the number of types of encoding
                  > > is small". I'd say
                  > > http://msdn.microsoft.com/library/de...rary/en-us/int
                  > > l/un icode_81rn.asp is not really a short list, although many of
                  > > these encodings are not likely to be found in your typical American
                  > > or Western European PC environment.
                  > >
                  > > Cheers,[/color]
                  >
                  > Agreed in the general case, but perhaps I should have made my
                  > situation a little clearer. The files that I need to deal with will
                  > only be one of a very small subset of all the possible encodings out
                  > there.
                  >
                  > At least now I know my thinking is more flawed than I though it
                  > was....[/color]

                  The best approach is to have some kind of "protocol", that allows to
                  transports meta data like character encoding. If this is not possible
                  (as in the case of plain files), let the user decide by allowing him or
                  her to select and switch all supported between all supported encodings.

                  Cheers,
                  --

                  mailto:news-reply@joergjoos s.de

                  Comment

                  • Roby Eisenbraun Martins

                    #10
                    Re: Determine File Encoding

                    Hello Marc,

                    If you open a file using StreamReader it will load a CurrentEncoding
                    with the correct file encoding and convert the bytes to the correct
                    characters.

                    "Joerg Jooss" wrote:
                    [color=blue]
                    > Marc Jennings wrote:
                    >[color=green]
                    > > On Wed, 01 Jun 2005 11:59:09 -0700, "Joerg Jooss"
                    > > <news-reply@joergjoos s.de> wrote:
                    > >
                    > > **snip**[color=darkred]
                    > > >
                    > > > The most obvious flaw would be that generally speaking this is
                    > > > impossible to achieve ;-)
                    > > >
                    > > > The second flaw is that your code is just plain wrong. You're using
                    > > > a UTF-8 StreamReader regardless of the actual encoding. This object
                    > > > will be able to read UTF-8 and ASCII, but UTF-16 will break for
                    > > > sure.
                    > > >
                    > > > The third flaw is that you assume "the number of types of encoding
                    > > > is small". I'd say
                    > > > http://msdn.microsoft.com/library/de...rary/en-us/int
                    > > > l/un icode_81rn.asp is not really a short list, although many of
                    > > > these encodings are not likely to be found in your typical American
                    > > > or Western European PC environment.
                    > > >
                    > > > Cheers,[/color]
                    > >
                    > > Agreed in the general case, but perhaps I should have made my
                    > > situation a little clearer. The files that I need to deal with will
                    > > only be one of a very small subset of all the possible encodings out
                    > > there.
                    > >
                    > > At least now I know my thinking is more flawed than I though it
                    > > was....[/color]
                    >
                    > The best approach is to have some kind of "protocol", that allows to
                    > transports meta data like character encoding. If this is not possible
                    > (as in the case of plain files), let the user decide by allowing him or
                    > her to select and switch all supported between all supported encodings.
                    >
                    > Cheers,
                    > --
                    > http://www.joergjooss.de
                    > mailto:news-reply@joergjoos s.de
                    >[/color]

                    Comment

                    • Jon Skeet [C# MVP]

                      #11
                      Re: Determine File Encoding

                      Roby Eisenbraun Martins
                      <RobyEisenbraun Martins@discuss ions.microsoft. com> wrote:[color=blue]
                      > If you open a file using StreamReader it will load a CurrentEncoding
                      > with the correct file encoding and convert the bytes to the correct
                      > characters.[/color]

                      Only if you're lucky. It won't be able to guess correctly between
                      different ANSI character sets, for instance.

                      It's definitely best to take the guesswork out, either by explicitly
                      stating the encoding, making sure there *is* only one encoding, or
                      allowing the user to override any guesswork which has been performed.

                      --
                      Jon Skeet - <skeet@pobox.co m>
                      Pobox has been discontinued as a separate service, and all existing customers moved to the Fastmail platform.

                      If replying to the group, please do not mail me too

                      Comment

                      Working...