How to determine if a file is UTF8 encoded?

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Thomas Podlesak

    How to determine if a file is UTF8 encoded?

    I need a check, if a file is utf8 encoded. I only found the php-functions
    'iconv' and 'recode'. But it seems it´s not possible to determine the
    encoding with them. Isn´t there any similar function to the 'file'-command
    on linux for php?
  • daemon

    #2
    Re: How to determine if a file is UTF8 encoded?

    -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Thomas Podlesak wrote:[color=blue]
    > I need a check, if a file is utf8 encoded. I only found the php-functions
    > 'iconv' and 'recode'. But it seems it´s not possible to determine the
    > encoding with them. Isn´t there any similar function to the 'file'-command
    > on linux for php?[/color]

    Ok, I'm barly understanding your request. UTF/UTF8/UTF16 affects the
    character set of your file. Now all HTML files can configure the browser
    to determin the character set just be defining:

    <meta http-equiv="Content-Type" content="applic ation/xml+xhtml;
    charset=UTF-8" />

    As well, if your sending PHP encoded file, you can pre-determin the
    filetype just by defining the character encoding threw the headers:

    header("Content-Type: application/xml+xhtml; charset=utf-8");

    Other then that, the rest is done threw your web server, picking the
    file... that if you wanted to encode it further you would have to
    convert it using the functions you said above.
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.2 (MingW32)
    Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

    iD8DBQFDfmPQ/WE0aXnOUiYRArvo AJ4tE6s7WPSRtBZ g1tLdFgJmFL5MNQ CbBKUe
    GR3X5SE21hzNzW4 k2UOh7f8=
    =xJ0s
    -----END PGP SIGNATURE-----

    Comment

    • Thomas Podlesak

      #3
      Re: How to determine if a file is UTF8 encoded?

      That's not my problem, daemon.

      The Problem is: The client uploads a csv-file. The php-script has to ensure
      that the uploaded file is utf-8 encoded.



      daemon wrote:
      [color=blue]
      > -----BEGIN PGP SIGNED MESSAGE-----
      > Hash: SHA1
      >
      > Thomas Podlesak wrote:[color=green]
      >> I need a check, if a file is utf8 encoded. I only found the php-functions
      >> 'iconv' and 'recode'. But it seems it´s not possible to determine the
      >> encoding with them. Isn´t there any similar function to the
      >> 'file'-command on linux for php?[/color]
      >
      > Ok, I'm barly understanding your request. UTF/UTF8/UTF16 affects the
      > character set of your file. Now all HTML files can configure the browser
      > to determin the character set just be defining:
      >
      > <meta http-equiv="Content-Type" content="applic ation/xml+xhtml;
      > charset=UTF-8" />
      >
      > As well, if your sending PHP encoded file, you can pre-determin the
      > filetype just by defining the character encoding threw the headers:
      >
      > header("Content-Type: application/xml+xhtml; charset=utf-8");
      >
      > Other then that, the rest is done threw your web server, picking the
      > file... that if you wanted to encode it further you would have to
      > convert it using the functions you said above.
      > -----BEGIN PGP SIGNATURE-----
      > Version: GnuPG v1.4.2 (MingW32)
      > Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org
      >
      > iD8DBQFDfmPQ/WE0aXnOUiYRArvo AJ4tE6s7WPSRtBZ g1tLdFgJmFL5MNQ CbBKUe
      > GR3X5SE21hzNzW4 k2UOh7f8=
      > =xJ0s
      > -----END PGP SIGNATURE-----[/color]

      Comment

      • Philip Ronan

        #4
        Re: How to determine if a file is UTF8 encoded?

        "Thomas Podlesak" wrote:
        [color=blue]
        > I need a check, if a file is utf8 encoded. I only found the php-functions
        > 'iconv' and 'recode'. But it seems it´s not possible to determine the
        > encoding with them. Isn´t there any similar function to the 'file'-command
        > on linux for php?[/color]

        Try this: <http://php.net/mb-detect-encoding> (but make sure you read the
        notes at the bottom, especially <http://php.net/mb-detect-encoding#50087> ).

        --
        phil [dot] ronan @ virgin [dot] net


        Comment

        • Chung Leong

          #5
          Re: How to determine if a file is UTF8 encoded?

          Try this:

          $text = file_get_conten ts("test.txt") ;
          echo preg_match('/./u', $text);

          The u modifier tell PCRE the input is UTF-8. If it's not properly
          encoded, then it'll return false.

          Comment

          • Ewoud Dronkert

            #6
            Re: How to determine if a file is UTF8 encoded?

            Chung Leong wrote:[color=blue]
            > echo preg_match('/./u', $text);[/color]

            That will match on any single utf8 character, which could potentially be
            followed by non-utf8 data... Also, I'm not sure about its behaviour when
            encountering such data.

            --
            E. Dronkert

            Comment

            • Chung Leong

              #7
              Re: How to determine if a file is UTF8 encoded?


              Ewoud Dronkert wrote:[color=blue]
              > Chung Leong wrote:[color=green]
              > > echo preg_match('/./u', $text);[/color]
              >
              > That will match on any single utf8 character, which could potentially be
              > followed by non-utf8 data... Also, I'm not sure about its behaviour when
              > encountering such data.
              >
              > --
              > E. Dronkert[/color]

              PCRE validates the string before it runs the expression.

              pcre.c:8037
              if (valid_utf8((us char *)subject, length) >= 0)
              return PCRE_ERROR_BADU TF8;

              If it isn't valid all the way through, then there's no match.

              Comment

              • Ewoud Dronkert

                #8
                Re: How to determine if a file is UTF8 encoded?

                Chung Leong wrote:
                [color=blue]
                > PCRE validates the string before it runs the expression.
                > If it isn't valid all the way through, then there's no match.[/color]

                OK, but aren't charsets like latin1 (8859-1) subsets of utf8, and us-ascii
                of them? So those would also be considered utf8.

                --
                E. Dronkert

                Comment

                • Chung Leong

                  #9
                  Re: How to determine if a file is UTF8 encoded?

                  Ewoud Dronkert wrote:[color=blue]
                  > Chung Leong wrote:
                  >[color=green]
                  > > PCRE validates the string before it runs the expression.
                  > > If it isn't valid all the way through, then there's no match.[/color]
                  >
                  > OK, but aren't charsets like latin1 (8859-1) subsets of utf8, and us-ascii
                  > of them? So those would also be considered utf8.[/color]

                  The Latin 1 is a subset of Unicode, true enough, with matching
                  codepoints. But when encoded as UTF-8, characters in the U+00F0 -
                  U+00FF will become 2 byte sequences. So text in 8859-1 with curly
                  quotes and such won't be identified as UTF-8. Text with characters only
                  in the basic Latin range (i.e. ASCII) would be identical to UTF-8.

                  It's of course possible to construct a text encoded in 8859-1, KOI8-R,
                  or whatever, that would appear as valid UTF-8. It'd be total gibberish
                  though. In a UTF-8 byte sequence, a byte with bit-6 on has to be
                  followed by a byte with bit-6 off. In a 8-bit charset, that means a
                  separation of at least 32 code points--too far apart to stay within the
                  alphabet.

                  Comment

                  • Thomas Podlesak

                    #10
                    Re: How to determine if a file is UTF8 encoded?

                    In a other group somebody recommended the PECL Fileinfo extension.


                    Seems ok for me.

                    Thanks for your help!
                    Thomas

                    Comment

                    Working...