convert strings to utf-8

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Niclas

    convert strings to utf-8

    Hi

    I'm having trouble to work with the special charcters in swedish (Å Ä Ö
    å ä ö). The script is parsing and extracting information from a webpage.
    This works fine and I get all the data correctly. The information is
    then added to a rss file (using xml.dom.minidom .Document() to create the
    file), this is where it goes wrong. Letters like Å ä ö get messed up and
    the rss file does not validate. How can I convert the data to UTF-8
    without loosing the special letters?

    Thanks in advance
  • Diez B. Roggisch

    #2
    Re: convert strings to utf-8

    Niclas schrieb:
    Hi
    >
    I'm having trouble to work with the special charcters in swedish (Å Ä Ö
    å ä ö). The script is parsing and extracting information from a webpage.
    This works fine and I get all the data correctly. The information is
    then added to a rss file (using xml.dom.minidom .Document() to create the
    file), this is where it goes wrong. Letters like Å ä ö get messed up and
    the rss file does not validate. How can I convert the data to UTF-8
    without loosing the special letters?
    Show us code, and example text (albeit I know it is difficult to get
    that right using news/mail)

    The basic idea is this:

    scrapped_byte_s tring = scrap_the_websi te()

    output = scrappend_byte_ string.decode(' website-encoding').enco de('utf-8')



    Diez

    Comment

    • =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

      #3
      Re: convert strings to utf-8

      Niclas schrieb:
      I'm having trouble to work with the special charcters in swedish (Å Ä Ö
      å ä ö). The script is parsing and extracting information from a webpage.
      This works fine and I get all the data correctly. The information is
      then added to a rss file (using xml.dom.minidom .Document() to create the
      file), this is where it goes wrong. Letters like Å ä ö get messed up and
      the rss file does not validate. How can I convert the data to UTF-8
      without loosing the special letters?
      You should convert the strings from the webpage to Unicode strings.
      You can see that a string is unicode of

      print isinstance(s,un icode)

      prints True. Make sure *every* string you put into the Document
      actually is a Unicode string. Then it will just work fine.

      Regards,
      Martin

      Comment

      • Niclas

        #4
        Re: convert strings to utf-8

        Thank you!

        solved it with this:
        unicode( data.decode('la tin_1') )
        and when I write it to the file...
        f = codecs.open(pat h, encoding='utf-8', mode='w+')
        f.write(self.__ rssDoc.toxml())

        Diez B. Roggisch skrev:
        Niclas schrieb:
        >Hi
        >>
        >I'm having trouble to work with the special charcters in swedish (Å Ä
        >Ö å ä ö). The script is parsing and extracting information from a
        >webpage. This works fine and I get all the data correctly. The
        >information is then added to a rss file (using
        >xml.dom.minido m.Document() to create the file), this is where it goes
        >wrong. Letters like Å ä ö get messed up and the rss file does not
        >validate. How can I convert the data to UTF-8 without loosing the
        >special letters?
        >
        Show us code, and example text (albeit I know it is difficult to get
        that right using news/mail)
        >
        The basic idea is this:
        >
        scrapped_byte_s tring = scrap_the_websi te()
        >
        output = scrappend_byte_ string.decode(' website-encoding').enco de('utf-8')
        >
        >
        >
        Diez

        Comment

        • Diez B. Roggisch

          #5
          Re: convert strings to utf-8

          Niclas schrieb:
          Thank you!
          >
          solved it with this:
          unicode( data.decode('la tin_1') )
          The unicode around this is superfluous. Either do

          unicode(bytestr ing, encoding)

          or

          bytestring.deco de(encoding)

          and when I write it to the file...
          f = codecs.open(pat h, encoding='utf-8', mode='w+')
          f.write(self.__ rssDoc.toxml())

          Looks good, yes.

          Diez

          Comment

          • John Nagle

            #6
            Re: convert strings to utf-8

            Diez B. Roggisch wrote:
            Niclas schrieb:
            >
            >Thank you!
            >>
            >solved it with this:
            > unicode( data.decode('la tin_1') )
            >
            >
            The unicode around this is superfluous.
            Worse, it's an error. utf-8 needs to go into a stream
            of 8-bit bytes, not a Unicode string.

            John Nagle

            Comment

            Working...