Validate string as UTF-8?

**david mugnai** · Nov 6 '05, 08:05 PM

Re: Validate string as UTF-8?

On Sun, 06 Nov 2005 18:58:50 +0000, Tony Nelson wrote:

[snip]
[color=blue]
> Is there a general way to call GLib functions?[/color]

ctypes?

http://starship.python.net/crew/theller/ctypes/

**Fredrik Lundh** · Nov 6 '05, 08:15 PM

Re: Validate string as UTF-8?

Tony Nelson wrote:
[color=blue]
> I'd like to have a fast way to validate large amounts of string data as
> being UTF-8.[/color]

define "validate".
[color=blue]
> I don't see a fast way to do it in Python, though:
>
> unicode(s,'utf-8').encode('utf-8)[/color]

if "validate" means "make sure the byte stream doesn't use invalid
sequences", a plain

unicode(s, "utf-8")

should be sufficient.

</F>

**Diez B. Roggisch** · Nov 6 '05, 08:25 PM

Re: Validate string as UTF-8?

Tony Nelson wrote:[color=blue]
> I'd like to have a fast way to validate large amounts of string data as
> being UTF-8.
>
> I don't see a fast way to do it in Python, though:
>
> unicode(s,'utf-8').encode('utf-8)
>
> seems to notice at least some of the time (the unicode() part works but
> the encode() part bombs). I don't consider a RE based solution to be
> fast. GLib provides a routine to do this, and I am using GTK so it's
> included in there somewhere, but I don't see a way to call GLib
> routines. I don't want to write another extension module.[/color]

I somehow doubt that the encode bombs. Can you provide some more
details? Maybe of some allegedly not working strings?

Besides that, it's unneccessary - the unicode(s, "utf-8") should be
sufficient. If there are any undecodable byte sequences in there, that
should find them.

Regards,

Diez

**Tony Nelson** · Nov 6 '05, 08:25 PM

Re: Validate string as UTF-8?

In article <pan.2005.11.06 .19.59.16.73191 9@gnx.it>,
david mugnai <asdrubale@gnx. it> wrote:
[color=blue]
> On Sun, 06 Nov 2005 18:58:50 +0000, Tony Nelson wrote:
>
> [snip]
>[color=green]
> > Is there a general way to call GLib functions?[/color]
>
> ctypes?
> http://starship.python.net/crew/theller/ctypes/[/color]

Umm. Might be easier to write an extension module.
_______________ _______________ _______________ _______________ ____________
TonyN.:' *firstname*nlsn ews@georgea*las tname*.com
' <http://www.georgeanels on.com/>

**Waitman Gobble** · Nov 6 '05, 08:45 PM

Re: Validate string as UTF-8?

I have done this using a sytem call to the program "recode". Recode a
file UTF-8 and do a diff on the original and recoded files. Not an
elegant solution but did seem to function properly.

Take care,

Waitman Gobble

**Tony Nelson** · Nov 6 '05, 08:55 PM

Re: Validate string as UTF-8?

In article <mailman.176.11 31307306.18701. python-list@python.org >,
"Fredrik Lundh" <fredrik@python ware.com> wrote:
[color=blue]
> Tony Nelson wrote:
>[color=green]
> > I'd like to have a fast way to validate large amounts of string data as
> > being UTF-8.[/color]
>
> define "validate".[/color]

All data conforms to the UTF-8 encoding format. I can stand if someone
has made data that impersonates UTF-8 that isn't really Unicode.

[color=blue][color=green]
> > I don't see a fast way to do it in Python, though:
> >
> > unicode(s,'utf-8').encode('utf-8)[/color]
>
> if "validate" means "make sure the byte stream doesn't use invalid
> sequences", a plain
>
> unicode(s, "utf-8")
>
> should be sufficient.[/color]

You are correct. I misunderstood what was happening in my code. I
apologise for wasting bandwidth and your time (and I wasted my own time
as well).

Indeed, unicode(s, 'utf-8') will catch the problem and is fast enough
for my purpose, adding about 25% to the time to load a file.
_______________ _______________ _______________ _______________ ____________
TonyN.:' *firstname*nlsn ews@georgea*las tname*.com
' <http://www.georgeanels on.com/>

Validate string as UTF-8?

Validate string as UTF-8?

Comment

Comment

Comment

Comment

Comment

Comment