Determining the encoding of a text file

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Rajorshi

    Determining the encoding of a text file

    Hello!
    How do I determine the encoding of a text file ? That is,
    given a text file I want to know the encoding it is in
    UTF8 or UTF16 or Latin etc. It would be very helpful if
    you could tell me how to do this in python on Linux. But
    just the method is acceptable.
    Thanks in advance!
  • Skip Montanaro

    #2
    Re: Determining the encoding of a text file


    rajorshi> How do I determine the encoding of a text file ? That is,
    rajorshi> given a text file I want to know the encoding it is in UTF8 or
    rajorshi> UTF16 or Latin etc. It would be very helpful if you could tell
    rajorshi> me how to do this in python on Linux. But just the method is
    rajorshi> acceptable.

    In general this is not possible. You can guess using heuristics, but there is
    no predefined file attribute that indicates a file's encoding.

    If you have a small set of candidate encodings you can generally do a decent
    job guessing the encoding of a string by considering them in order. I placed
    an example on my Python Bits page: <http://www.musi-cal.com/~skip/python/>. I
    don't claim it's perfect and it's really only concerned with distiguishing
    utf-8 and a few encodings which are similar to iso-8859-1, but it does a
    decent job for me given the types of inputs I see.

    Skip

    Comment

    • David Opstad

      #3
      Re: Determining the encoding of a text file

      In article <85b5e3f8.04030 10224.939e8f8@p osting.google.c om>,
      rajorshi@fastma il.fm (Rajorshi) wrote:
      [color=blue]
      > How do I determine the encoding of a text file ? That is,
      > given a text file I want to know the encoding it is in
      > UTF8 or UTF16 or Latin etc. It would be very helpful if
      > you could tell me how to do this in python on Linux. But
      > just the method is acceptable.[/color]

      If the first byte in the file is 0xFE and the second is 0xFF, then it's
      likely the file is encoded in big-endian UTF-16. If the first byte is
      0xFF and the second is 0xFE, then it's likely to be little-endian UTF-16.

      Once you've eliminated those possibilities, then it gets trickier...

      Dave

      Comment

      • J.R.

        #4
        Re: Determining the encoding of a text file


        "Rajorshi" <rajorshi@fastm ail.fm> wrote in message
        news:85b5e3f8.0 403010224.939e8 f8@posting.goog le.com...[color=blue]
        > Hello!
        > How do I determine the encoding of a text file ? That is,
        > given a text file I want to know the encoding it is in
        > UTF8 or UTF16 or Latin etc. It would be very helpful if
        > you could tell me how to do this in python on Linux. But
        > just the method is acceptable.
        > Thanks in advance![/color]

        The python integrated development environment IDLE, which is distributed
        alone with python, shows one approach how to decode a
        string. You could find it in the file $PYTHON/lib/idlelib/IOBinding.py, find
        the decode().

        But it's not perfect, you could integrate with Skip's example writing your
        one.
        Additional, if you want to guess the Chinese encoding, the perl lib

        may be for your reference, it can support GB2312-80, Hz, Big5, UTF-8, etc.

        J.R.


        Comment

        • Rajorshi

          #5
          Re: Determining the encoding of a text file

          Thanks for your suggestions!


          "J.R." <j.r.gao@motoro la.com> wrote in message news:<c20r4m$jn $1@newshost.mot .com>...[color=blue]
          > "Rajorshi" <rajorshi@fastm ail.fm> wrote in message
          > news:85b5e3f8.0 403010224.939e8 f8@posting.goog le.com...[color=green]
          > > Hello!
          > > How do I determine the encoding of a text file ? That is,
          > > given a text file I want to know the encoding it is in
          > > UTF8 or UTF16 or Latin etc. It would be very helpful if
          > > you could tell me how to do this in python on Linux. But
          > > just the method is acceptable.
          > > Thanks in advance![/color]
          >
          > The python integrated development environment IDLE, which is distributed
          > alone with python, shows one approach how to decode a
          > string. You could find it in the file $PYTHON/lib/idlelib/IOBinding.py, find
          > the decode().
          >
          > But it's not perfect, you could integrate with Skip's example writing your
          > one.
          > Additional, if you want to guess the Chinese encoding, the perl lib
          > http://www.mandarintools.com/download/codelib.zip
          > may be for your reference, it can support GB2312-80, Hz, Big5, UTF-8, etc.
          >
          > J.R.[/color]

          Comment

          Working...