byte count unicode string

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • willie

    byte count unicode string

    >willie wrote:
    >Marc 'BlackJack' Rintsch:
    >>
    > >In <mailman.313.11 58732191.10491. python-l...@python.org >, willie
    wrote:
    > ># What's the correct way to get the
    > ># byte count of a unicode (UTF-8) string?
    > ># I couldn't find a builtin method
    > ># and the following is memory inefficient.
    > >ustr = "example\xC2\x9 D".decode('U TF-8')
    > >num_chars = len(ustr) # 8
    > >buf = ustr.encode('UT F-8')
    > >num_bytes = len(buf) # 9
    > >That is the correct way.
    ># Apologies if I'm being dense, but it seems
    ># unusual that I'd have to make a copy of a
    ># unicode string, converting it into a byte
    ># string, before I can determine the size (in bytes)
    ># of the unicode string. Can someone provide the rational
    ># for that or correct my misunderstandin g?
    >You initially asked "What's the correct way to get the byte countof a
    >unicode (UTF-8) string".
    >
    >It appears you meant "How can I find how many bytes there are in the
    >UTF-8 representation of a Unicode string without manifesting the UTF-8
    >representation ?".
    >
    >The answer is, "You can't", and the rationale would have to be that
    >nobody thought of a use case for counting the length of the UTF-8 form
    >but not creating the UTF-8 form. What is your use case?
    # Sorry for the confusion. My use case is a web app that
    # only deals with UTF-8 strings. I want to prevent silent
    # truncation of the data, so I want to validate the number
    # of bytes that make up the unicode string before sending
    # it to the database to be written.

    # For instance, say I have a name column that is varchar(50).
    # The 50 is in bytes not characters. So I can't use the length of
    # the unicode string to check if it's over the maximum allowed bytes.

    name = post.input('nam e') # utf-8 string

    # preferable
    if bytes(name) 50:
    send_http_heade rs()
    display_page_be gin()
    display_error_m sg('the name is too long')
    display_form(na me)
    display_page_en d()

    # If I have a form with many input elements,
    # I have to convert each to a byte string
    # before i can see how many bytes make up the
    # unicode string. That's very memory inefficient
    # with large text fields - having to duplicate each
    # one to get its size in bytes:

    buf = name.encode('UT F-8')
    num_bytes = len(buf)


    # That said, I'm not losing any sleep over it,
    # so feel free to disregard any of this if it's
    # way off base.
  • John Machin

    #2
    Re: byte count unicode string

    willie wrote:
    willie wrote:
    >Marc 'BlackJack' Rintsch:
    >>
    > >In <mailman.313.11 58732191.10491. python-l...@python.org >, willie
    wrote:
    > ># What's the correct way to get the
    > ># byte count of a unicode (UTF-8) string?
    > ># I couldn't find a builtin method
    > ># and the following is memory inefficient.
    >
    > >ustr = "example\xC2\x9 D".decode('U TF-8')
    >
    > >num_chars = len(ustr) # 8
    >
    > >buf = ustr.encode('UT F-8')
    >
    > >num_bytes = len(buf) # 9
    >
    > >That is the correct way.
    >
    ># Apologies if I'm being dense, but it seems
    ># unusual that I'd have to make a copy of a
    ># unicode string, converting it into a byte
    ># string, before I can determine the size (in bytes)
    ># of the unicode string. Can someone provide the rational
    ># for that or correct my misunderstandin g?
    >
    >You initially asked "What's the correct way to get the byte countof a
    >unicode (UTF-8) string".
    >
    >It appears you meant "How can I find how many bytes there are in the
    >UTF-8 representation of a Unicode string without manifesting the UTF-8
    >representation ?".
    >
    >The answer is, "You can't", and the rationale would have to be that
    >nobody thought of a use case for counting the length of the UTF-8 form
    >but not creating the UTF-8 form. What is your use case?
    >
    # Sorry for the confusion. My use case is a web app that
    # only deals with UTF-8 strings. I want to prevent silent
    # truncation of the data, so I want to validate the number
    # of bytes that make up the unicode string before sending
    # it to the database to be written.
    >
    # For instance, say I have a name column that is varchar(50).
    # The 50 is in bytes not characters. So I can't use the length of
    # the unicode string to check if it's over the maximum allowed bytes.
    What is the database API expecting to get as an arg: a Python unicode
    object, or a Python str (8-bit, presumably encoded in utf-8) ?
    >
    name = post.input('nam e') # utf-8 string
    You are confusing the hell out of yourself. You say that your web app
    deals only with UTF-8 strings. Where do you get "the unicode string"
    from??? If name is a utf-8 string, as your comment says, then len(name)
    is all you need!!!

    *PLEASE* print type(name), repr(name) so that we can see what type it
    is!!
    If it says the type is str, then it's an 8-bit string, (presumably)
    encoded in utf-8.
    If it says the type is unicode, then please explain "web app that only
    deals with UTF-8 strings" ...
    >
    # preferable
    if bytes(name) 50:
    send_http_heade rs()
    display_page_be gin()
    display_error_m sg('the name is too long')
    display_form(na me)
    display_page_en d()
    >
    # If I have a form with many input elements,
    # I have to convert each to a byte string
    # before i can see how many bytes make up the
    # unicode string. That's very memory inefficient
    # with large text fields - having to duplicate each
    # one to get its size in bytes:
    They'd be garbage collected unless you worked very hard to hang on to
    them. How large is "large"?

    Comment

    Working...