Unicode BOM marks

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Francis Girard

    Unicode BOM marks

    Hi,

    For the first time in my programmer life, I have to take care of character
    encoding. I have a question about the BOM marks.

    If I understand well, into the UTF-8 unicode binary representation, some
    systems add at the beginning of the file a BOM mark (Windows?), some don't.
    (Linux?). Therefore, the exact same text encoded in the same UTF-8 will
    result in two different binary files, and of a slightly different length.
    Right ?

    I guess that this leading BOM mark are special marking bytes that can't be, in
    no way, decoded as valid text.
    Right ?
    (I really really hope the answer is yes otherwise we're in hell when moving
    file from one platform to another, even with the same Unicode encoding).

    I also guess that this leading BOM mark is silently ignored by any unicode
    aware file stream reader to which we already indicated that the file follows
    the UTF-8 encoding standard.
    Right ?

    If so, is it the case with the python codecs decoder ?

    In python documentation, I see theseconstants. The documentation is not clear
    to which encoding these constants apply. Here's my understanding :

    BOM : UTF-8 only or UTF-8 and UTF-32 ?
    BOM_BE : UTF-8 only or UTF-8 and UTF-32 ?
    BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?
    BOM_UTF8 : UTF-8 only
    BOM_UTF16 : UTF-16 only
    BOM_UTF16_BE : UTF-16 only
    BOM_UTF16_LE : UTF-16 only
    BOM_UTF32 : UTF-32 only
    BOM_UTF32_BE : UTF-32 only
    BOM_UTF32_LE : UTF-32 only

    Why should I need these constants if codecs decoder can handle them without my
    help, only specifying the encoding ?

    Thank you

    Francis Girard




    Python tells me to use an encoding declaration at the top of my files (the
    message is referring to http://www.python.org/peps/pep-0263.html).

    I expected to see there a list of acceptable

  • Martin v. Löwis

    #2
    Re: Unicode BOM marks

    Francis Girard wrote:[color=blue]
    > If I understand well, into the UTF-8 unicode binary representation, some
    > systems add at the beginning of the file a BOM mark (Windows?), some don't.
    > (Linux?). Therefore, the exact same text encoded in the same UTF-8 will
    > result in two different binary files, and of a slightly different length.
    > Right ?[/color]

    Mostly correct. I would prefer if people referred to the thing not as
    "BOM" but as "UTF-8 signature", atleast in the context of UTF-8, as
    UTF-8 has no byte-order issues that a "byte order mark" would deal with.
    (it is correct to call it "BOM" in the context of UTF-16 or UTF-32).

    Also, "some systems" is inadequate. It is not so much the operating
    system that decides to add or leave out the UTF-8 signature, but much
    more the application writing the file. Any high-quality tool will accept
    the file with or without signature, whether it is a tool on Windows
    or a tool on Unix.

    I personally would write my applications so that they put the signature
    into files that cannot be concatenated meaningfully (since the
    signature simplifies encoding auto-detection) and leave out the
    signature from files which can be concatenated (as concatenating the
    files will put the signature in the middle of a file).

    [color=blue]
    > I guess that this leading BOM mark are special marking bytes that can't be, in
    > no way, decoded as valid text.
    > Right ?[/color]

    Wrong. The BOM mark decodes as U+FEFF:
    [color=blue][color=green][color=darkred]
    >>> codecs.BOM_UTF8 .decode("utf-8")[/color][/color][/color]
    u'\ufeff'

    This is what makes it a byte order mark: in UTF-16, you can tell the
    byte order by checking whether it is FEFF or FFFE. The character U+FFFE
    is an invalid character, which cannot be decoded as valid text
    (although the Python codec will decode it as invalid text).
    [color=blue]
    > I also guess that this leading BOM mark is silently ignored by any unicode
    > aware file stream reader to which we already indicated that the file follows
    > the UTF-8 encoding standard.
    > Right ?[/color]

    No. It should eventually be ignored by the application, but whether the
    stream reader special-cases it or not is depends on application needs.
    [color=blue]
    > If so, is it the case with the python codecs decoder ?[/color]

    No; the Python UTF-8 codec is unaware of the UTF-8 signature. It reports
    it to the application when it finds it, and it will never generate the
    signature on its own. So processing the UTF-8 signature is left to the
    application in Python.
    [color=blue]
    > In python documentation, I see theseconstants. The documentation is not clear
    > to which encoding these constants apply. Here's my understanding :
    >
    > BOM : UTF-8 only or UTF-8 and UTF-32 ?[/color]

    UTF-16.
    [color=blue]
    > BOM_BE : UTF-8 only or UTF-8 and UTF-32 ?
    > BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?[/color]

    UTF-16
    [color=blue]
    > Why should I need these constants if codecs decoder can handle them without my
    > help, only specifying the encoding ?[/color]

    Well, because the codecs don't. It might be useful to add a
    "utf-8-signature" codec some day, which generates the signature on
    encoding, and removes it on decoding.

    Regards,
    Martin

    Comment

    • Francis Girard

      #3
      Re: Unicode BOM marks

      Le lundi 7 Mars 2005 21:54, "Martin v. Löwis" a écrit :

      Hi,

      Thank you for your very informative answer. Some interspersed remarks follow.
      [color=blue]
      >
      > I personally would write my applications so that they put the signature
      > into files that cannot be concatenated meaningfully (since the
      > signature simplifies encoding auto-detection) and leave out the
      > signature from files which can be concatenated (as concatenating the
      > files will put the signature in the middle of a file).
      >[/color]

      Well, no text files can't be concatenated ! Sooner or later, someone will use
      "cat" on the text files your application did generate. That will be a lot of
      fun for the new unicode aware "super-cat".
      [color=blue][color=green]
      > > I guess that this leading BOM mark are special marking bytes that can't
      > > be, in no way, decoded as valid text.
      > > Right ?[/color]
      >
      > Wrong. The BOM mark decodes as U+FEFF:[color=green][color=darkred]
      > >>> codecs.BOM_UTF8 .decode("utf-8")[/color][/color]
      >
      > u'\ufeff'[/color]

      I meant "valid text" to denote human readable actual real natural language
      text. My intent with this question was to get sure that we can easily
      distinguish a UTF-8 with the signature from one without. Your answer implies
      a "yes".
      [color=blue][color=green]
      > > I also guess that this leading BOM mark is silently ignored by any
      > > unicode aware file stream reader to which we already indicated that the
      > > file follows the UTF-8 encoding standard.
      > > Right ?[/color]
      >
      > No. It should eventually be ignored by the application, but whether the
      > stream reader special-cases it or not is depends on application needs.
      >[/color]

      Well, for most of us, I think, the need is to transparently decode the input
      into a unique internal unicode encoding (UFT-16 for both java and Qt ; Qt
      docs saying there might be a need to switch to UFT-32 some day) and then be
      able to manipulate this internal text with the usual tools your programming
      system provides. By "transparen t", I mean, at least, to be able to
      automatically process the two variants of the same UTF-8 encoding. We should
      only have to specify "UTF-8" and the streamer takes care of the rest.

      BTW, the python "unicode" built-in function documentation says it returns a
      "unicode" string which scarcely means something. What is the python
      "internal" unicode encoding ?
      [color=blue]
      >
      > No; the Python UTF-8 codec is unaware of the UTF-8 signature. It reports
      > it to the application when it finds it, and it will never generate the
      > signature on its own. So processing the UTF-8 signature is left to the
      > application in Python.
      >[/color]
      Ok.
      [color=blue][color=green]
      > > In python documentation, I see theseconstants. The documentation is not
      > > clear to which encoding these constants apply. Here's my understanding :
      > >
      > > BOM : UTF-8 only or UTF-8 and UTF-32 ?[/color]
      >
      > UTF-16.
      >[color=green]
      > > BOM_BE : UTF-8 only or UTF-8 and UTF-32 ?
      > > BOM_LE : UTF-8 only or UTF-8 and UTF-32 ?[/color]
      >
      > UTF-16
      >[/color]
      Ok.
      [color=blue][color=green]
      > > Why should I need these constants if codecs decoder can handle them
      > > without my help, only specifying the encoding ?[/color]
      >
      > Well, because the codecs don't. It might be useful to add a
      > "utf-8-signature" codec some day, which generates the signature on
      > encoding, and removes it on decoding.
      >[/color]
      Ok.

      My sincere thanks,

      Francis Girard
      [color=blue]
      > Regards,
      > Martin[/color]

      Comment

      • Jeff Epler

        #4
        Re: Unicode BOM marks

        On Mon, Mar 07, 2005 at 11:56:57PM +0100, Francis Girard wrote:[color=blue]
        > BTW, the python "unicode" built-in function documentation says it returnsa
        > "unicode" string which scarcely means something. What is the python
        > "internal" unicode encoding ?[/color]

        The language reference says farily little about unicode objects. Here's
        what it does say: [http://docs.python.org/ref/types.html#l2h-48]
        Unicode
        The items of a Unicode object are Unicode code units. A Unicode
        code unit is represented by a Unicode object of one item and can
        hold either a 16-bit or 32-bit value representing a Unicode
        ordinal (the maximum value for the ordinal is given in
        sys.maxunicode, and depends on how Python is configured at
        compile time). Surrogate pairs may be present in the Unicode
        object, and will be reported as two separate items. The built-in
        functions unichr() and ord() convert between code units and
        nonnegative integers representing the Unicode ordinals as
        defined in the Unicode Standard 3.0. Conversion from and to
        other encodings are possible through the Unicode method encode
        and the built-in function unicode().

        In terms of the CPython implementation, the PyUnicodeObject is laid out
        as follows:
        typedef struct {
        PyObject_HEAD
        int length; /* Length of raw Unicode data in buffer*/
        Py_UNICODE *str; /* Raw Unicode buffer */
        long hash; /* Hash value; -1 if not set */
        PyObject *defenc; /* (Default) Encoded version as Python
        string, or NULL; this is used for
        implementing the buffer protocol */
        } PyUnicodeObject ;
        Py_UNICODE is some "C" integral type that can hold values up to
        sys.maxunicode (probably one of unsigned short, unsigned int, unsigned
        long, wchar_t).

        Jeff

        -----BEGIN PGP SIGNATURE-----
        Version: GnuPG v1.2.6 (GNU/Linux)

        iD8DBQFCLOdqJd0 1MZaTXX0RAqyCAJ 4mUgO1YqRbu+ElF UYQkQrjq0WobgCc CqSo
        1CicckGcZKYTbQo BeKKQs5I=
        =QhqS
        -----END PGP SIGNATURE-----

        Comment

        • Martin v. Löwis

          #5
          Re: Unicode BOM marks

          Francis Girard wrote:[color=blue]
          > Well, no text files can't be concatenated ! Sooner or later, someone will use
          > "cat" on the text files your application did generate. That will be a lot of
          > fun for the new unicode aware "super-cat".[/color]

          Well, no. For example, Python source code is not typically concatenated,
          nor is source code in any other language. The same holds for XML files:
          concatenating two XML documents (using cat) gives an ill-formed document
          - whether the files start with an UTF-8 signature or not.

          As for the "super-cat": there is actually no problem with putting U+FFFE
          in the middle of some document - applications are supposed to filter it
          out. The precise processing instructions in the Unicode standard vary
          from Unicode version to Unicode version, but essentially, you are
          supposed to ignore the BOM if you see it.
          [color=blue]
          > BTW, the python "unicode" built-in function documentation says it returns a
          > "unicode" string which scarcely means something. What is the python
          > "internal" unicode encoding ?[/color]

          A Unicode string is a sequence of integers. The numbers are typically
          represented as base-2, but the details depend on the C compiler.
          It is specifically *not* UTF-16, big or little endian (i.e. a single
          number is *not* a sequence of bytes). It may be UCS-2 or UCS-4,
          depending on a compile-time choice (which can be determined by looking
          at sys.maxunicode, which in turn can be either 65535 or 1114111).

          The programming interface to the individual characters is formed by
          the unichr and ord builtin functions, which expect and return integers
          between 0 and sys.maxunicode.

          Regards,
          Martin

          Comment

          • Francis Girard

            #6
            Re: Unicode BOM marks

            Hi,
            [color=blue]
            > Well, no. For example, Python source code is not typically concatenated,
            > nor is source code in any other language.[/color]

            We did it with C++ files in order to have only one compilation unit to
            accelarate compilation time over network. Also, all the languages with some
            "include" directive will have to take care of it. I guess a unicode aware C
            pre-compiler already does.
            [color=blue]
            > As for the "super-cat": there is actually no problem with putting U+FFFE
            > in the middle of some document - applications are supposed to filter it
            > out. The precise processing instructions in the Unicode standard vary
            > from Unicode version to Unicode version, but essentially, you are
            > supposed to ignore the BOM if you see it.[/color]

            Ok. I'm re-assured.
            [color=blue]
            > A Unicode string is a sequence of integers. The numbers are typically
            > represented as base-2, but the details depend on the C compiler.
            > It is specifically *not* UTF-16, big or little endian (i.e. a single
            > number is *not* a sequence of bytes). It may be UCS-2 or UCS-4,
            > depending on a compile-time choice (which can be determined by looking
            > at sys.maxunicode, which in turn can be either 65535 or 1114111).
            >
            > The programming interface to the individual characters is formed by
            > the unichr and ord builtin functions, which expect and return integers
            > between 0 and sys.maxunicode.[/color]

            Ok. I guess that Python gives the flexibility of being configurable (when
            compiling Python) to internally represent unicode strings as fixed 2 or 4
            bytes per characters (UCS).

            Thank you
            Francis Girard

            Comment

            • John Roth

              #7
              Re: Unicode BOM marks


              ""Martin v. Löwis"" <martin@v.loewi s.de> wrote in message
              news:422cf441$0 $12162$9b622d9e @news.freenet.d e...[color=blue]
              > Francis Girard wrote:[color=green]
              >> Well, no text files can't be concatenated ! Sooner or later, someone will
              >> use "cat" on the text files your application did generate. That will be a
              >> lot of fun for the new unicode aware "super-cat".[/color]
              >
              > Well, no. For example, Python source code is not typically concatenated,
              > nor is source code in any other language. The same holds for XML files:
              > concatenating two XML documents (using cat) gives an ill-formed document
              > - whether the files start with an UTF-8 signature or not.[/color]

              And if you're talking HTML and XML, the situation is even worse, since
              the application absolutely needs to be aware of the signature. HTML might
              have a <meta ... > directive close to the front to tell you what the
              encoding
              is supposed to be, and then again, it might not. You should be able to
              depend
              on the first character being a <, but you might not be able to. FitNesse,
              for
              example, sends FIT a file that consists of the HTML between the <body>
              and </body> tags, and nothing else. This situation makes character set
              detection in PyFit, um, interesting. (Fortunately, I have other ways of
              dealing with FitNesse, but it's still an issue for batch use.)
              [color=blue]
              > As for the "super-cat": there is actually no problem with putting U+FFFE
              > in the middle of some document - applications are supposed to filter it
              > out. The precise processing instructions in the Unicode standard vary
              > from Unicode version to Unicode version, but essentially, you are
              > supposed to ignore the BOM if you see it.[/color]

              It would be useful for "super-cat" to filter all but the first one, however.

              John Roth[color=blue]
              >
              >
              > Regards,
              > Martin
              >[/color]

              Comment

              • Steve Horsley

                #8
                Re: Unicode BOM marks

                Francis Girard wrote:[color=blue]
                > Le lundi 7 Mars 2005 21:54, "Martin v. Löwis" a écrit :
                >
                > Hi,
                >
                > Thank you for your very informative answer. Some interspersed remarks follow.
                >
                >[color=green]
                >>I personally would write my applications so that they put the signature
                >>into files that cannot be concatenated meaningfully (since the
                >>signature simplifies encoding auto-detection) and leave out the
                >>signature from files which can be concatenated (as concatenating the
                >>files will put the signature in the middle of a file).
                >>[/color]
                >
                >
                > Well, no text files can't be concatenated ! Sooner or later, someone will use
                > "cat" on the text files your application did generate. That will be a lot of
                > fun for the new unicode aware "super-cat".
                >[/color]

                It is my understanding that the BOM (U+feff) is actually the
                Unicode character "Non-breaking zero-width space". I take
                this to mean that the character can appear invisibly
                anywhere in text, and its appearance as the first character
                of a text is pretty harmless. Concateniating files will
                leave invisible space characters in the middle of the text,
                but presumably not in the middle of words, so no harm is
                done there either.

                I suspect that the fact that an explicitly invisible
                character feff has an invalid character code fffe for its
                byte-reversed counterpart is no accident, and that the
                charecter was intended from inception to also server as a
                byte order indication.

                Steve

                Comment

                • Martin v. Löwis

                  #9
                  Re: Unicode BOM marks

                  Steve Horsley wrote:[color=blue]
                  > It is my understanding that the BOM (U+feff) is actually the Unicode
                  > character "Non-breaking zero-width space".[/color]

                  My understanding is that this used to be the case. According to



                  the application should now specify specific processing, and both
                  simply dropping it, or reporting an error are both acceptable behaviour.
                  Applications that need the ZWNBSP behaviour (i.e. want to indicate that
                  there should be no break at this point) should use U+2060 (WORD JOINER).

                  Regards,
                  Martin

                  Comment

                  Working...