>willie wrote:
wrote:
# Sorry for the confusion. My use case is a web app that
# only deals with UTF-8 strings. I want to prevent silent
# truncation of the data, so I want to validate the number
# of bytes that make up the unicode string before sending
# it to the database to be written.
# For instance, say I have a name column that is varchar(50).
# The 50 is in bytes not characters. So I can't use the length of
# the unicode string to check if it's over the maximum allowed bytes.
name = post.input('nam e') # utf-8 string
# preferable
if bytes(name) 50:
send_http_heade rs()
display_page_be gin()
display_error_m sg('the name is too long')
display_form(na me)
display_page_en d()
# If I have a form with many input elements,
# I have to convert each to a byte string
# before i can see how many bytes make up the
# unicode string. That's very memory inefficient
# with large text fields - having to duplicate each
# one to get its size in bytes:
buf = name.encode('UT F-8')
num_bytes = len(buf)
# That said, I'm not losing any sleep over it,
# so feel free to disregard any of this if it's
# way off base.
>Marc 'BlackJack' Rintsch:
>>
>>
> >In <mailman.313.11 58732191.10491. python-l...@python.org >, willie
> ># What's the correct way to get the
> ># byte count of a unicode (UTF-8) string?
> ># I couldn't find a builtin method
> ># and the following is memory inefficient.
> ># byte count of a unicode (UTF-8) string?
> ># I couldn't find a builtin method
> ># and the following is memory inefficient.
> >ustr = "example\xC2\x9 D".decode('U TF-8')
> >num_chars = len(ustr) # 8
> >buf = ustr.encode('UT F-8')
> >num_bytes = len(buf) # 9
> >That is the correct way.
># Apologies if I'm being dense, but it seems
># unusual that I'd have to make a copy of a
># unicode string, converting it into a byte
># string, before I can determine the size (in bytes)
># of the unicode string. Can someone provide the rational
># for that or correct my misunderstandin g?
># unusual that I'd have to make a copy of a
># unicode string, converting it into a byte
># string, before I can determine the size (in bytes)
># of the unicode string. Can someone provide the rational
># for that or correct my misunderstandin g?
>You initially asked "What's the correct way to get the byte countof a
>unicode (UTF-8) string".
>
>It appears you meant "How can I find how many bytes there are in the
>UTF-8 representation of a Unicode string without manifesting the UTF-8
>representation ?".
>
>The answer is, "You can't", and the rationale would have to be that
>nobody thought of a use case for counting the length of the UTF-8 form
>but not creating the UTF-8 form. What is your use case?
>unicode (UTF-8) string".
>
>It appears you meant "How can I find how many bytes there are in the
>UTF-8 representation of a Unicode string without manifesting the UTF-8
>representation ?".
>
>The answer is, "You can't", and the rationale would have to be that
>nobody thought of a use case for counting the length of the UTF-8 form
>but not creating the UTF-8 form. What is your use case?
# only deals with UTF-8 strings. I want to prevent silent
# truncation of the data, so I want to validate the number
# of bytes that make up the unicode string before sending
# it to the database to be written.
# For instance, say I have a name column that is varchar(50).
# The 50 is in bytes not characters. So I can't use the length of
# the unicode string to check if it's over the maximum allowed bytes.
name = post.input('nam e') # utf-8 string
# preferable
if bytes(name) 50:
send_http_heade rs()
display_page_be gin()
display_error_m sg('the name is too long')
display_form(na me)
display_page_en d()
# If I have a form with many input elements,
# I have to convert each to a byte string
# before i can see how many bytes make up the
# unicode string. That's very memory inefficient
# with large text fields - having to duplicate each
# one to get its size in bytes:
buf = name.encode('UT F-8')
num_bytes = len(buf)
# That said, I'm not losing any sleep over it,
# so feel free to disregard any of this if it's
# way off base.
Comment