Validate string as UTF-8?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Tony Nelson

    Validate string as UTF-8?

    I'd like to have a fast way to validate large amounts of string data as
    being UTF-8.

    I don't see a fast way to do it in Python, though:

    unicode(s,'utf-8').encode('utf-8)

    seems to notice at least some of the time (the unicode() part works but
    the encode() part bombs). I don't consider a RE based solution to be
    fast. GLib provides a routine to do this, and I am using GTK so it's
    included in there somewhere, but I don't see a way to call GLib
    routines. I don't want to write another extension module.

    Is there a (fast) Python function to validate UTF-8 data?

    Is there some other fast way to validate UTF-8 data?

    Is there a general way to call GLib functions?
    _______________ _______________ _______________ _______________ ____________
    TonyN.:' *firstname*nlsn ews@georgea*las tname*.com
    ' <http://www.georgeanels on.com/>
  • david mugnai

    #2
    Re: Validate string as UTF-8?

    On Sun, 06 Nov 2005 18:58:50 +0000, Tony Nelson wrote:

    [snip]
    [color=blue]
    > Is there a general way to call GLib functions?[/color]

    ctypes?


    Comment

    • Fredrik Lundh

      #3
      Re: Validate string as UTF-8?

      Tony Nelson wrote:
      [color=blue]
      > I'd like to have a fast way to validate large amounts of string data as
      > being UTF-8.[/color]

      define "validate".
      [color=blue]
      > I don't see a fast way to do it in Python, though:
      >
      > unicode(s,'utf-8').encode('utf-8)[/color]

      if "validate" means "make sure the byte stream doesn't use invalid
      sequences", a plain

      unicode(s, "utf-8")

      should be sufficient.

      </F>



      Comment

      • Diez B. Roggisch

        #4
        Re: Validate string as UTF-8?

        Tony Nelson wrote:[color=blue]
        > I'd like to have a fast way to validate large amounts of string data as
        > being UTF-8.
        >
        > I don't see a fast way to do it in Python, though:
        >
        > unicode(s,'utf-8').encode('utf-8)
        >
        > seems to notice at least some of the time (the unicode() part works but
        > the encode() part bombs). I don't consider a RE based solution to be
        > fast. GLib provides a routine to do this, and I am using GTK so it's
        > included in there somewhere, but I don't see a way to call GLib
        > routines. I don't want to write another extension module.[/color]

        I somehow doubt that the encode bombs. Can you provide some more
        details? Maybe of some allegedly not working strings?

        Besides that, it's unneccessary - the unicode(s, "utf-8") should be
        sufficient. If there are any undecodable byte sequences in there, that
        should find them.

        Regards,

        Diez

        Comment

        • Tony Nelson

          #5
          Re: Validate string as UTF-8?

          In article <pan.2005.11.06 .19.59.16.73191 9@gnx.it>,
          david mugnai <asdrubale@gnx. it> wrote:
          [color=blue]
          > On Sun, 06 Nov 2005 18:58:50 +0000, Tony Nelson wrote:
          >
          > [snip]
          >[color=green]
          > > Is there a general way to call GLib functions?[/color]
          >
          > ctypes?
          > http://starship.python.net/crew/theller/ctypes/[/color]

          Umm. Might be easier to write an extension module.
          _______________ _______________ _______________ _______________ ____________
          TonyN.:' *firstname*nlsn ews@georgea*las tname*.com
          ' <http://www.georgeanels on.com/>

          Comment

          • Waitman Gobble

            #6
            Re: Validate string as UTF-8?

            I have done this using a sytem call to the program "recode". Recode a
            file UTF-8 and do a diff on the original and recoded files. Not an
            elegant solution but did seem to function properly.

            Take care,

            Waitman Gobble

            Comment

            • Tony Nelson

              #7
              Re: Validate string as UTF-8?

              In article <mailman.176.11 31307306.18701. python-list@python.org >,
              "Fredrik Lundh" <fredrik@python ware.com> wrote:
              [color=blue]
              > Tony Nelson wrote:
              >[color=green]
              > > I'd like to have a fast way to validate large amounts of string data as
              > > being UTF-8.[/color]
              >
              > define "validate".[/color]

              All data conforms to the UTF-8 encoding format. I can stand if someone
              has made data that impersonates UTF-8 that isn't really Unicode.

              [color=blue][color=green]
              > > I don't see a fast way to do it in Python, though:
              > >
              > > unicode(s,'utf-8').encode('utf-8)[/color]
              >
              > if "validate" means "make sure the byte stream doesn't use invalid
              > sequences", a plain
              >
              > unicode(s, "utf-8")
              >
              > should be sufficient.[/color]

              You are correct. I misunderstood what was happening in my code. I
              apologise for wasting bandwidth and your time (and I wasted my own time
              as well).

              Indeed, unicode(s, 'utf-8') will catch the problem and is fast enough
              for my purpose, adding about 25% to the time to load a file.
              _______________ _______________ _______________ _______________ ____________
              TonyN.:' *firstname*nlsn ews@georgea*las tname*.com
              ' <http://www.georgeanels on.com/>

              Comment

              Working...