What is mb_internal_encoding() excactly?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Erwin Moller

    What is mb_internal_encoding() excactly?


    Hi,

    [Exuse me for a rather lengthy post. I try to explain as well as I can
    what I do understand on multibyte encoding and what not.]

    Background: I am working on a multilanguage project now, so I decided to
    switch to UTF-8 completely to avoid troubles with unicode character.

    I hope somebody can review my approach and comment on it.
    I am working on:
    Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch11
    I am testing on FF2/FF3/IE7.


    What I did so far:
    Please interupt anything that is wrong/vague/stupid. ;-)

    1) Every page contains this header:
    Content-Type: text/html; charset=UTF-8
    and has the following doctype:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
    "http://www.w3.org/TR/html4/strict.dtd">
    (All HTML is checked against W3C validator, so far so good.)

    2) My Database (Postgres8.1) is created using UTF-8 encoding.
    (As I didn't overrule anything for any table or column, all my text-like
    fields use UTF-8)

    3) I do NOT specify any character encoding in a META-tag.
    (Ill-advised by W3C, they say the header takes precedence over
    META-tags, and using the META tag may confuse some clients)

    4) Whenever I need strlen($aString ) or something similar, I use the
    multibytevarian t mb_strlen($aStr ing,'UTF-8').

    5) When I need to display a random string (from the database for
    example), I use:
    htmlspecialchar s($someStrFromD B,ENT_QUOTES,'U TF-8');
    If I must put a value in a text-element or textarea in a form, I use the
    same.

    6) I use ADODB5 as database abstractionlaye r. It has a build-in
    qstr-method that makes the passed string safe for use in SQL.

    7) I get my multibyte characters from here for testing:
    Finde deine Wunsch-Domain mit nur wenigen Klicks. ✓ Günstige Preise ✓ SSL-Zertifikat & ✓ 100 % deutsche Server bei freenet Mail


    So far, so good (as far as I can tell).


    php.net says the following for mb_strlen:
    int mb_strlen ( string $str [, string $encoding ] )
    Parameters
    str: The string being checked for length.
    encoding : The encoding parameter is the character encoding. If it is
    omitted, the internal character encoding value will be used.
    --I do not understand what this 'internal character encoding value' is.

    The page points to: mb_internal_enc oding()
    Which reads:
    Set/Get the internal character encoding

    Return Values: If encoding is set, then Returns TRUE on success or FALSE
    on failure. If encoding is omitted, then the current character encoding
    name is returned.
    If I echo mb_internal_enc oding() it says: ISO-8859-1
    I wonder where PHP did get that value from.

    I tried saving my PHP file in UTF-8, but it stays on ISO-8859-1.

    My main questions are:
    1) What is this mb_internal_enc oding excactly?
    It that something set during compilation?
    Should I overwite it to UTF-8, or is using the extra parameter in all
    mb_* functions good enough (and set it to UTF-8)?

    2) Should I put in all my forms accept-charset="UTF-8" or is that set
    implicity by my header (which always contain: Content-Type: text/html;
    charset=UTF-8)?

    3) Is it wise to safe all my PHP files in UTF-8?

    I hope somebody can enlighten me a little on these issues. :-)
    Thanks for your time!

    Regards,
    Erwin Moller


    --
    =============== =============
    Erwin Moller
    Now dropping all postings from googlegroups.
    Why? http://improve-usenet.org/
    =============== =============
  • Curtis

    #2
    Re: What is mb_internal_enc oding() excactly?

    Erwin Moller wrote:
    >
    Hi,
    >
    [Exuse me for a rather lengthy post. I try to explain as well as I can
    what I do understand on multibyte encoding and what not.]
    >
    Background: I am working on a multilanguage project now, so I decided to
    switch to UTF-8 completely to avoid troubles with unicode character.
    >
    I hope somebody can review my approach and comment on it.
    I am working on:
    Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch11
    I am testing on FF2/FF3/IE7.
    >
    >
    What I did so far:
    Please interupt anything that is wrong/vague/stupid. ;-)
    >
    1) Every page contains this header:
    Content-Type: text/html; charset=UTF-8
    and has the following doctype:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
    "http://www.w3.org/TR/html4/strict.dtd">
    (All HTML is checked against W3C validator, so far so good.)
    >
    2) My Database (Postgres8.1) is created using UTF-8 encoding.
    (As I didn't overrule anything for any table or column, all my text-like
    fields use UTF-8)
    >
    3) I do NOT specify any character encoding in a META-tag.
    (Ill-advised by W3C, they say the header takes precedence over
    META-tags, and using the META tag may confuse some clients)
    >
    4) Whenever I need strlen($aString ) or something similar, I use the
    multibytevarian t mb_strlen($aStr ing,'UTF-8').
    >
    5) When I need to display a random string (from the database for
    example), I use:
    htmlspecialchar s($someStrFromD B,ENT_QUOTES,'U TF-8');
    If I must put a value in a text-element or textarea in a form, I use the
    same.
    >
    6) I use ADODB5 as database abstractionlaye r. It has a build-in
    qstr-method that makes the passed string safe for use in SQL.
    >
    7) I get my multibyte characters from here for testing:
    Finde deine Wunsch-Domain mit nur wenigen Klicks. ✓ Günstige Preise ✓ SSL-Zertifikat & ✓ 100 % deutsche Server bei freenet Mail

    >
    So far, so good (as far as I can tell).
    >
    >
    php.net says the following for mb_strlen:
    int mb_strlen ( string $str [, string $encoding ] )
    Parameters
    str: The string being checked for length.
    encoding : The encoding parameter is the character encoding. If it is
    omitted, the internal character encoding value will be used.
    >
    --I do not understand what this 'internal character encoding value' is.
    >
    The page points to: mb_internal_enc oding()
    Which reads:
    Set/Get the internal character encoding
    >
    Return Values: If encoding is set, then Returns TRUE on success or FALSE
    on failure. If encoding is omitted, then the current character encoding
    name is returned.
    >
    If I echo mb_internal_enc oding() it says: ISO-8859-1
    I wonder where PHP did get that value from.
    >
    I tried saving my PHP file in UTF-8, but it stays on ISO-8859-1.
    >
    My main questions are:
    1) What is this mb_internal_enc oding excactly?
    It that something set during compilation?
    Should I overwite it to UTF-8, or is using the extra parameter in all
    mb_* functions good enough (and set it to UTF-8)?
    >
    2) Should I put in all my forms accept-charset="UTF-8" or is that set
    implicity by my header (which always contain: Content-Type: text/html;
    charset=UTF-8)?
    >
    3) Is it wise to safe all my PHP files in UTF-8?
    >
    I hope somebody can enlighten me a little on these issues. :-)
    Thanks for your time!
    >
    Regards,
    Erwin Moller
    >
    >
    I was also investigating this the other day. As for your concern of
    where PHP gets the internal coding setting, it comes from the
    [mbstring] portion of the php.ini config. If the directives are
    commented out, it seems to default to ISO-8859-1.

    Other than that, I'm just as curious as you. :-)

    --
    Curtis

    Comment

    • AqD

      #3
      Re: What is mb_internal_enc oding() excactly?

      On Sep 17, 5:58 pm, Erwin Moller
      <Since_humans_r ead_this_I_am_s pammed_too_m... @spamyourself.c omwrote:
      Hi,
      >
      [Exuse me for a rather lengthy post. I try to explain as well as I can
      what I do understand on multibyte encoding and what not.]
      >
      Background: I am working on a multilanguage project now, so I decided to
      switch to UTF-8 completely to avoid troubles with unicode character.
      >
      I hope somebody can review my approach and comment on it.
      I am working on:
      Server: Apache/2.2.3 (Debian) PHP/5.2.0-8+etch11
      I am testing on FF2/FF3/IE7.
      >
      What I did so far:
      Please interupt anything that is wrong/vague/stupid. ;-)
      >
      1) Every page contains this header:
      Content-Type: text/html; charset=UTF-8
      and has the following doctype:
      <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
      "http://www.w3.org/TR/html4/strict.dtd">
      (All HTML is checked against W3C validator, so far so good.)
      Yes
      >
      2) My Database (Postgres8.1) is created using UTF-8 encoding.
      (As I didn't overrule anything for any table or column, all my text-like
      fields use UTF-8)
      If you're using mysql, be careful that you have to set your client
      encoding for connection. If you don't (a lot of 'unicode' projects
      don't do that), it would treat your utf-8 sql statements as latin1 and
      convert them wrongly inside the db.

      To set the encoding, you need to call functions such as
      mysqli_set_char set. It also affects the string escape method.
      >
      3) I do NOT specify any character encoding in a META-tag.
      (Ill-advised by W3C, they say the header takes precedence over
      META-tags, and using the META tag may confuse some clients)
      some clients like IE4? ;) Basically all websites here (mis-)use the
      meta tag for charset instead of setting the header. As long as the
      encoding is latin1-compatible (like utf8), it should be fine.

      I stopped listening to their advices or reading their references for a
      long time. If you want something to work, it's better to test it with
      real implementations (i.e. the browsers).
      >
      4) Whenever I need strlen($aString ) or something similar, I use the
      multibytevarian t mb_strlen($aStr ing,'UTF-8').
      Same for sub-string and any other operations on string characters. But
      there are performance issues and I hope you'll not run into them ;)
      >
      5) When I need to display a random string (from the database for
      example), I use:
      htmlspecialchar s($someStrFromD B,ENT_QUOTES,'U TF-8');
      If I must put a value in a text-element or textarea in a form, I use the
      same.
      yes
      >
      6) I use ADODB5 as database abstractionlaye r. It has a build-in
      qstr-method that makes the passed string safe for use in SQL.
      safe only for the correct encoding. You need to set the encoding like
      I wrote above. If ADODB doesn't provide the method to change encoding,
      you can do a query "SET NAMES utf8" after connecting - I'm not sure
      how this works with the escape function though.
      >
      7) I get my multibyte characters from here for testing:http://freenet-homepage.de/prilop/multilingual-1.html
      >
      So far, so good (as far as I can tell).
      >
      php.net says the following for mb_strlen:
      int mb_strlen  ( string $str  [, string $encoding  ] )
      Parameters
      str: The string being checked for length.
      encoding : The encoding parameter is the character encoding. If it is
      omitted, the internal character encoding value will be used.
      >
      --I do not understand what this 'internal character encoding value' is.
      >
      The page points to: mb_internal_enc oding()
      Which reads:
      Set/Get the internal character encoding
      It's the default encoding for certain mbstring functiosn. Not
      "internal". The mbstring extension (except for some regex functions)
      can be used to deal with strings of more than encodings at the same
      once.
      >
      Return Values: If encoding is set, then Returns TRUE on success or FALSE
      on failure. If encoding is omitted, then the current character encoding
      name is returned.
      >
      If I echo mb_internal_enc oding() it says: ISO-8859-1
      I wonder where PHP did get that value from.
      >
      I tried saving my PHP file in UTF-8, but it stays on ISO-8859-1.
      >
      My main questions are:
      1) What is this mb_internal_enc oding excactly?
      It that something set during compilation?
      Should I overwite it to UTF-8, or is using the extra parameter in all
      mb_* functions good enough (and set it to UTF-8)?
      php.ini

      You can also set it in the beginning of code. Don't use the extra
      parameter unless you want to deal other encodings - as I said some
      regex fuctions don't have it, because they save states between
      different calls and the encoding cannot change during it.
      >
      2) Should I put in all my forms  accept-charset="UTF-8" or is that set
      implicity by my header (which always contain: Content-Type: text/html;
      charset=UTF-8)?
      >
      No need.
      3) Is it wise to safe all my PHP files in UTF-8?
      yes, and do not save with utf-8 signature.

      Comment

      • Taras_96

        #4
        Re: What is mb_internal_enc oding() excactly?

        On Sep 18, 2:08 am, AqD <aquila.d...@gm ail.comwrote:
        On Sep 17, 5:58 pm, Erwin Moller
        >
        >
        3) I do NOT specify any character encoding in a META-tag.
        (Ill-advised by W3C, they say the header takes precedence over
        META-tags, and using the META tag may confuse some clients)
        >
        some clients like IE4? ;) Basically all websites here (mis-)use the
        meta tag for charset instead of setting the header. As long as the
        encoding is latin1-compatible (like utf8), it should be fine.
        >
        I stopped listening to their advices or reading their references for a
        long time. If you want something to work, it's better to test it with
        real implementations (i.e. the browsers).
        >
        I think the meta option is provided because in some environments you
        don't have full control of the headers being generated (eg: hosted
        solutions). I could be wrong on this.

        I don't know why a client would get confused if they got the character
        encoding in both the header and a meta tag... perhaps if they were
        different?
        >
        6) I use ADODB5 as database abstractionlaye r. It has a build-in
        qstr-method that makes the passed string safe for use in SQL.
        >
        safe only for the correct encoding. You need to set the encoding like
        I wrote above. If ADODB doesn't provide the method to change encoding,
        you can do a query "SET NAMES utf8" after connecting - I'm not sure
        how this works with the escape function though.
        >
        The mysql_real_esca pe_string takes into account the character encoding
        the database is expecting.. not sure about your DBAL though.
        [QUOTE]
        >
        --I do not understand what this 'internal character encoding value' is.
        >
        The page points to: mb_internal_enc oding()
        Which reads:
        Set/Get the internal character encoding
        >
        It's the default encoding for certain mbstring functiosn. Not
        "internal". The mbstring extension (except for some regex functions)
        can be used to deal with strings of more than encodings at the same
        once.
        >
        >
        That's what I gathered, 'internal encoding' is a bit misleading, I
        tend to think of it more as a 'default' encoding.. many of the mb
        functions take in a character encoding as an optional parameter, if
        you don't supply it this parameter, it will assume that the encoding
        of the input string is the 'internal' (ie: default) one.

        HTH

        Taras

        Comment

        • AqD

          #5
          Re: What is mb_internal_enc oding() excactly?

          On Sep 19, 7:41 pm, Taras_96 <taras...@gmail .comwrote:
          On Sep 18, 2:08 am,AqD<aquila.d ...@gmail.comwr ote:
          >
          On Sep 17, 5:58 pm, Erwin Moller
          >
          3) I do NOT specify any character encoding in a META-tag.
          (Ill-advised by W3C, they say the header takes precedence over
          META-tags, and using the META tag may confuse some clients)
          >
          some clients like IE4? ;) Basically all websites here (mis-)use the
          meta tag for charset instead of setting the header. As long as the
          encoding is latin1-compatible (like utf8), it should be fine.
          >
          I stopped listening to their advices or reading their references for a
          long time. If you want something to work, it's better to test it with
          real implementations (i.e. the browsers).
          >
          I think the meta option is provided because in some environments you
          don't have full control of the headers being generated (eg: hosted
          solutions). I could be wrong on this.
          >
          I don't know why a client would get confused if they got the character
          encoding in both the header and a meta tag... perhaps if they were
          different?
          If it's different, browser should use the encoding from header (I
          tested this before). But the meta tag only works with ASCII/iso8859-1
          based encodings, not UCS2 or UCS4.
          >
          >
          >
          6) I use ADODB5 as database abstractionlaye r. It has a build-in
          qstr-method that makes the passed string safe for use in SQL.
          >
          safe only for the correct encoding. You need to set the encoding like
          I wrote above. If ADODB doesn't provide the method to change encoding,
          you can do a query "SET NAMES utf8" after connecting - I'm not sure
          how this works with the escape function though.
          >
          The mysql_real_esca pe_string takes into account the character encoding
          the database is expecting.. not sure about your DBAL though.
          True but most developers only set the database encoding not connection
          encoding, which is assumed to be latin1 by mysql, so they end up
          storing data in wrong encoding in database even through the text on
          webpages are correct ;) The problem is still very *popular" now - you
          can check the code of some open-source projects such as phpbb and
          xoops.

          Comment

          Working...