Encoding for Devanagari Script.

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Atul.

    Encoding for Devanagari Script.

    Hello All,

    I wanted to know what encoding should I use to open the files with
    Devanagari characters. I was thinking of UTF-8 but was not sure, any
    leads on this? Anyone used it earlier?

    Thanks in Advance.

    Regards,
    Atul.
  • Fredrik Lundh

    #2
    Re: Encoding for Devanagari Script.

    Atul. skrev:
    I wanted to know what encoding should I use to open the files with
    Devanagari characters. I was thinking of UTF-8 but was not sure, any
    leads on this? Anyone used it earlier?
    Are we talking about existing files? If you don't know what encoding
    the files use, you could always try using the UTF-8 codec; it's very
    likely to complain if you're attempting to decode something that's isn't
    UTF-8.

    If that doesn't work, it's a bit trickier -- there are several ways to
    encode Unicode, and then there's ISCII as well. If you cannot sort it
    out, try running this:
    >>f = open("myfile.tx t", "rb")
    >>f.read(32)
    on one of your files, and post the result, and chances are that someone
    will be able to identify the encoding.

    </F>

    Comment

    • Terry Reedy

      #3
      Re: Encoding for Devanagari Script.



      Atul. wrote:
      Hello All,
      >
      I wanted to know what encoding should I use to open the files with
      Devanagari characters. I was thinking of UTF-8 but was not sure, any
      leads on this? Anyone used it earlier?
      You cannot hurt your machine by giving that a try.

      This is a general comment for all beginners. Before posting, open the
      interactive interpreter (or IDLE) and try something(s). If the result
      puzzles you, copy and paste into a post. Or if more appropriate, open
      the Python manuals and search a bit, or try a search engine.

      Comment

      • Atul.

        #4
        Re: Encoding for Devanagari Script.

        Hi Fredrik and Terry,

        Well I got this on IDLE I think I have done something wrong.
        >>import codecs
        >>f = open("C:\Docume nts and Settings\admin\ My Documents\corpu s\dainaikAikya collected by sushant.txt","r ", "utf_8")
        Traceback (most recent call last):
        File "<pyshell#1 >", line 1, in <module>
        f = open("C:\Docume nts and Settings\admin\ My Documents\corpu s
        \dainaikAikya collected by sushant.txt","r ", "utf_8")
        TypeError: an integer is required

        after that I tried the read binary mode and tried reading the firt 32
        bytes and this is what I got.
        >>f = open("C:\Docume nts and Settings\\admin \\My Documents\\corp us\\dainaikAiky a collected by sushant.txt","r b")
        >>f.read(32)
        '\xef\xbb\xbf\x e0\xa4\xa8\xe0\ xa4\xb5\xe0\xa5 \x80
        \xe0\xa4\xa6\xe 0\xa4\xbf\xe0\x a4\xb2\xe0\xa5\ x8d
        \xe0\xa4\xb2\xe 0\xa5\x80,'

        Now based on my knowledge of Unicode I think this is a utf-8 file (the
        first 3 bytes \xef\xbb\xbf), please correct me if I am wrong. How do I
        read this?

        Atul.

        PS: the above code I wrote using the information from the Library
        Reference pdf section 4.8 "Codecs". Something wrong I am doing? Please
        do let me know.



        On Jul 25, 6:21 am, Terry Reedy <tjre...@udel.e duwrote:
        Atul. wrote:
        Hello All,
        >
        I wanted to know what encoding should I use to open the files with
        Devanagarichara cters. I was thinking of UTF-8 but was not sure, any
        leads on this? Anyone used it earlier?
        >
        You cannot hurt your machine by giving that a try.
        >
        This is a general comment for all beginners.  Before posting, open the
        interactive interpreter (or IDLE) and try something(s).  If the result
        puzzles you, copy and paste into a post.  Or if more appropriate, open
        the Python manuals and search a bit, or try a search engine.

        Comment

        • Tim Golden

          #5
          Re: Encoding for Devanagari Script.

          Atul. wrote:
          Hi Fredrik and Terry,
          >
          Well I got this on IDLE I think I have done something wrong.
          >
          >>>import codecs
          >>>f = open("C:\Docume nts and Settings\admin\ My Documents\corpu s\dainaikAikya collected by sushant.txt","r ", "utf_8")
          >
          Traceback (most recent call last):
          File "<pyshell#1 >", line 1, in <module>
          f = open("C:\Docume nts and Settings\admin\ My Documents\corpu s
          \dainaikAikya collected by sushant.txt","r ", "utf_8")
          TypeError: an integer is required
          >
          PS: the above code I wrote using the information from the Library
          Reference pdf section 4.8 "Codecs". Something wrong I am doing? Please
          do let me know.

          Only slightly. You're importing the codecs module
          but you're not using it. So you're *actually* using
          the built-in open function, which doesn't have an
          encoding parameter. It does have a third param
          which is to do with the buffer size.

          Just change your code to use codecs.open ("...")
          and, I suggest, either use raw strings for your
          filename (r"c:\docume... ") or use the other kind
          of slash ("c:/documen..."). Otherwise you might
          run into some problems.

          TJG

          Comment

          • Atul.

            #6
            Re: Encoding for Devanagari Script.

            Thanks, Tim that did work. I will proceed with my playing around now.

            Thanks a ton.

            Atul.
            >
            Only slightly. You're importing the codecs module
            but you're not using it. So you're *actually* using
            the built-in open function, which doesn't have an
            encoding parameter. It does have a third param
            which is to do with the buffer size.
            >
            Just change your code to use codecs.open ("...")
            and, I suggest, either use raw strings for your
            filename (r"c:\docume... ") or use the other kind
            of slash ("c:/documen..."). Otherwise you might
            run into some problems.
            >
            TJG

            Comment

            Working...