utf-8 read/write file

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Bruno

    utf-8 read/write file

    Hi!

    I have big .txt file which i want to read, process and write to another .txt file.
    I have done script for that, but im having problem with croatian characters
    (Š,Đ,Ž,Č,Ć).
    How can I read/write from/to file in utf-8 encoding?
    I read file with fileinput.input .

    thanks
  • Benjamin

    #2
    Re: utf-8 read/write file

    On Oct 8, 12:49 pm, Bruno <Br...@hi.t-com.hrwrote:
    Hi!
    >
    I have big .txt file which i want to read, process and write to another .txt file.
    I have done script for that, but im having problem with croatian characters
    (©,Ð,®,È,Æ).
    Can you show us what you have so far?
    How can I read/write from/to file in utf-8 encoding?
    import codecs
    data = codecs.open("my-utf8-file.txt").read ()
    I read file with fileinput.input .
    >
    thanks

    Comment

    • gigs

      #3
      Re: utf-8 read/write file

      Benjamin wrote:
      On Oct 8, 12:49 pm, Bruno <Br...@hi.t-com.hrwrote:
      >Hi!
      >>
      >I have big .txt file which i want to read, process and write to another .txt file.
      >I have done script for that, but im having problem with croatian characters
      >(©,Ð,®,È,Æ).
      >
      Can you show us what you have so far?
      >
      >How can I read/write from/to file in utf-8 encoding?
      >
      import codecs
      data = codecs.open("my-utf8-file.txt").read ()
      >
      >I read file with fileinput.input .
      >>
      >thanks
      >
      I have tried with codecs, but when i use encoding="utf-8" i get this error on
      word : ¾ivot

      Traceback (most recent call last):
      File "C:\Users\Admin istrator\Deskto p\getcontent.py ", line 43, in <module>
      encoding="utf-8").readline s()
      File "C:\Python25\Li b\codecs.py", line 626, in readlines
      return self.reader.rea dlines(sizehint )
      File "C:\Python25\Li b\codecs.py", line 535, in readlines
      data = self.read()
      File "C:\Python25\Li b\codecs.py", line 424, in read
      newchars, decodedbytes = self.decode(dat a, self.errors)
      UnicodeDecodeEr ror: 'utf8' codec can't decode byte 0x9e in position 0:
      unexpected code byte


      i just need to read from file1.txt, process (its simple text processing) some
      words and write them to file2.txt without loss of croatian characters. (¹ð¾èæ)

      Comment

      • Kent Johnson

        #4
        Re: utf-8 read/write file

        On Oct 8, 5:55 pm, gigs <g...@hi.t-com.hrwrote:
        Benjamin wrote:
        On Oct 8, 12:49 pm, Bruno <Br...@hi.t-com.hrwrote:
        Hi!
        >
        I have big .txt file which i want to read, process and write to another .txt file.
        I have done script for that, but im having problem with croatian characters
        (©,Ð,®,È,Æ).
        >
        UnicodeDecodeEr ror: 'utf8' codec can't decode byte 0x9e in position 0:
        unexpected code byte
        Are you sure you have UTF-8 data? I guess your file is encoded in
        CP1250 or CP1252; in both of these charsets 0x9e represents LATIN
        SMALL LETTER Z WITH CARON.

        Kent

        Comment

        • gigs

          #5
          Re: utf-8 read/write file

          Kent Johnson wrote:
          On Oct 8, 5:55 pm, gigs <g...@hi.t-com.hrwrote:
          >Benjamin wrote:
          >>On Oct 8, 12:49 pm, Bruno <Br...@hi.t-com.hrwrote:
          >>>Hi!
          >>>I have big .txt file which i want to read, process and write to another .txt file.
          >>>I have done script for that, but im having problem with croatian characters
          >>>(©,Ð,®,È,Æ ).
          >UnicodeDecodeE rror: 'utf8' codec can't decode byte 0x9e in position 0:
          >unexpected code byte
          >
          Are you sure you have UTF-8 data? I guess your file is encoded in
          CP1250 or CP1252; in both of these charsets 0x9e represents LATIN
          SMALL LETTER Z WITH CARON.
          >
          Kent
          This data wasnt in utf-8 probably, today i get another one utf-8 and its working

          thanks

          Comment

          Working...