Determine file type (binary or text)

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Sami Viitanen

    Determine file type (binary or text)

    Hello,

    How can I check if a file is binary or text?

    There was some easy way but I forgot it..


    Thanks in adv.


  • bromden

    #2
    Re: Determine file type (binary or text)

    > How can I check if a file is binary or text?
    [color=blue][color=green][color=darkred]
    >>> import os
    >>> f = os.popen('file -bi test.py', 'r')
    >>> f.read().starts with('text')[/color][/color][/color]
    1

    (btw, f.read() returns 'text/x-java; charset=us-ascii\n')

    --
    bromden[at]gazeta.pl

    Comment

    • bromden

      #3
      Re: Determine file type (binary or text)

      > >>> f = os.popen('file -bi test.py', 'r')[color=blue][color=green][color=darkred]
      > >>> f.read().starts with('text')[/color][/color][/color]

      sorry, it's not general, since "file -i" returns
      "applicatio n/x-shellscript" for shell scripts,
      it's better to go like that:[color=blue][color=green][color=darkred]
      >>> import os
      >>> f = os.popen('file test.py', 'r')
      >>> f.read().find(' text') != -1[/color][/color][/color]

      --
      bromden[at]gazeta.pl

      Comment

      • Sami Viitanen

        #4
        Re: Determine file type (binary or text)

        Works well in Unix but I'm making a script that works on both
        Unix and Windows.

        Win doesn't have that 'file -bi' command.

        "bromden" <bromden@gazeta .pl.no.spam> wrote in message
        news:bhd559$ku9 $1@absinth.dial og.net.pl...[color=blue][color=green]
        > > How can I check if a file is binary or text?[/color]
        >[color=green][color=darkred]
        > >>> import os
        > >>> f = os.popen('file -bi test.py', 'r')
        > >>> f.read().starts with('text')[/color][/color]
        > 1
        >
        > (btw, f.read() returns 'text/x-java; charset=us-ascii\n')
        >
        > --
        > bromden[at]gazeta.pl
        >[/color]


        Comment

        • Michael Peuser

          #5
          Re: Determine file type (binary or text)

          Hi,
          yes there is more than just Unix in the world ;-)
          Windows directories have no means to specify their contents type in any way.
          The approved method is using three-letter extensions, though this rule is
          not strictly followed (lot of files without extension nowadays!)

          When I had a similar problem I read 1000 characters, counted the amount of
          <32 and >255 characters and classified it "binary when this qota exceeded
          20%. I have no idea whether it will work good with chinese unicode files or
          some funny depositories or project files that store uncompressed texts....

          KIndly
          Michael P

          "Sami Viitanen" <none@none.ne t> schrieb im Newsbeitrag
          news:v7p_a.1558 $k4.32814@news2 .nokia.com...[color=blue]
          > Works well in Unix but I'm making a script that works on both
          > Unix and Windows.
          >
          > Win doesn't have that 'file -bi' command.
          >
          > "bromden" <bromden@gazeta .pl.no.spam> wrote in message
          > news:bhd559$ku9 $1@absinth.dial og.net.pl...[color=green][color=darkred]
          > > > How can I check if a file is binary or text?[/color]
          > >[color=darkred]
          > > >>> import os
          > > >>> f = os.popen('file -bi test.py', 'r')
          > > >>> f.read().starts with('text')[/color]
          > > 1
          > >
          > > (btw, f.read() returns 'text/x-java; charset=us-ascii\n')
          > >
          > > --
          > > bromden[at]gazeta.pl
          > >[/color]
          >
          >[/color]


          Comment

          • Karl Scalet

            #6
            Re: Determine file type (binary or text)

            Michael Peuser schrieb:[color=blue]
            > Hi,
            > yes there is more than just Unix in the world ;-)
            > Windows directories have no means to specify their contents type in any way.[/color]

            That's even more true with linux/unix, as there is no need to do
            any stuff like line-terminator conversion.
            [color=blue]
            > The approved method is using three-letter extensions, though this rule is
            > not strictly followed (lot of files without extension nowadays!)
            >
            > When I had a similar problem I read 1000 characters, counted the amount of
            > <32 and >255 characters and classified it "binary when this qota exceeded
            > 20%. I have no idea whether it will work good with chinese unicode files or
            > some funny depositories or project files that store uncompressed texts....[/color]

            based on the idea from Mr. "bromden", why not use mimetypes.MimeT ypes()
            and guess_type('fil e://...') and analye the returned string.
            This should work on windows / linux / unix / whatever.


            Karl

            [color=blue]
            >
            > KIndly
            > Michael P
            >
            > "Sami Viitanen" <none@none.ne t> schrieb im Newsbeitrag
            > news:v7p_a.1558 $k4.32814@news2 .nokia.com...
            >[color=green]
            >>Works well in Unix but I'm making a script that works on both
            >>Unix and Windows.
            >>
            >>Win doesn't have that 'file -bi' command.
            >>
            >>"bromden" <bromden@gazeta .pl.no.spam> wrote in message
            >>news:bhd559$k u9$1@absinth.di alog.net.pl...
            >>[color=darkred]
            >>>>How can I check if a file is binary or text?
            >>>
            >>> >>> import os
            >>> >>> f = os.popen('file -bi test.py', 'r')
            >>> >>> f.read().starts with('text')
            >>>1
            >>>
            >>>(btw, f.read() returns 'text/x-java; charset=us-ascii\n')
            >>>
            >>>--
            >>>bromden[at]gazeta.pl
            >>>[/color]
            >>
            >>[/color]
            >
            >[/color]

            Comment

            • Peter Hansen

              #7
              Re: Determine file type (binary or text)

              Sami Viitanen wrote:[color=blue]
              >
              > How can I check if a file is binary or text?
              >
              > There was some easy way but I forgot it..[/color]

              First you need to define what you mean by binary and text.
              Is a file "text" simply because it contains only the
              printable (in ASCII) bytes between 31 and 127, plus
              CR and/or LF, or do you have a more complex definition
              in mind.

              Better yet, what do you need the information for? Maybe
              the answer to that will show us the proper path to take.

              Comment

              • Trent Mick

                #8
                Re: Determine file type (binary or text)

                [Sami Viitanen wrote][color=blue]
                > Hello,
                >
                > How can I check if a file is binary or text?
                >
                > There was some easy way but I forgot it..[/color]

                Generally I define a text file as "it has no null bytes". I think this
                is a pretty safe definition (I would be interested to hear practical
                experience to the contrary). Assuming that, then:

                def is_binary(filen ame):
                """Return true iff the given filename is binary.

                Raises an EnvironmentErro r if the file does not exist or cannot be
                accessed.
                """
                fin = open(filename, 'rb')
                try:
                CHUNKSIZE = 1024
                while 1:
                chunk = fin.read(CHUNKS IZE)
                if '\0' in chunk: # found null byte
                return 1
                if len(chunk) < CHUNKSIZE:
                break # done
                finally:
                fin.close()

                return 0

                Cheers,
                Trent


                --
                Trent Mick
                TrentM@ActiveSt ate.com

                Comment

                • Grant Edwards

                  #9
                  Re: Determine file type (binary or text)

                  In article <AFm_a.9725$g4. 189983@news1.no kia.com>, Sami Viitanen wrote:
                  [color=blue]
                  > How can I check if a file is binary or text?[/color]

                  In order to provide an answer, you'll have to define "binary"
                  and "text".
                  [color=blue]
                  > There was some easy way but I forgot it..[/color]

                  To _me_ a file isn't "binary" or "text". Those are two modes
                  you can use to read a file. The file itself is neutral on the
                  matter. At least under Windows and Unix. VMS and FILES-11
                  contained a _lot_ more meta-data and actually did have several
                  different fundamental file types (fixed length records,
                  variable length records, byte-stream, etc.).

                  --
                  Grant Edwards grante Yow! Will it improve my
                  at CASH FLOW?
                  visi.com

                  Comment

                  • Peter Hansen

                    #10
                    Re: Determine file type (binary or text)

                    Trent Mick wrote:[color=blue]
                    >
                    > [Sami Viitanen wrote][color=green]
                    > > Hello,
                    > >
                    > > How can I check if a file is binary or text?
                    > >
                    > > There was some easy way but I forgot it..[/color]
                    >
                    > Generally I define a text file as "it has no null bytes". I think this
                    > is a pretty safe definition (I would be interested to hear practical
                    > experience to the contrary).[/color]

                    "Contains only printable characters" is probably a more useful definition
                    of text in many cases. I can't say off the top of my head exactly when
                    either definition might be a problem.... wait, how about this one: in
                    CVS, if you don't have a file that is effectively line-oriented, human
                    readable information, you probably don't want to let it be treated as
                    "text" and stored as diffs. In that situation, "contains primarily
                    printable characters organized in lines" is probably a more thorough,
                    though less deterministic, definition.

                    -Peter

                    Comment

                    • Graham Fawcett

                      #11
                      Re: Determine file type (binary or text)

                      Trent Mick wrote:
                      [color=blue]
                      >[Sami Viitanen wrote]
                      >
                      >[color=green]
                      >>Hello,
                      >>
                      >>How can I check if a file is binary or text?
                      >>
                      >>There was some easy way but I forgot it..
                      >>
                      >>[/color]
                      >
                      >Generally I define a text file as "it has no null bytes". I think this
                      >is a pretty safe definition (I would be interested to hear practical
                      >experience to the contrary).
                      >[/color]

                      Dangerous assumption. Even if many or most binary files contain NULs, it
                      doesn't mean that they all do.

                      It is trivial to create a non-text file that has no NULs.

                      f = open('no_zeroes .bin', 'rb')
                      for x in range(1, 256):
                      f.write(chr(x))
                      f.close()

                      Sami, I would suggest that you need to stop thinking in terms of tools,
                      and instead think in terms of the problem you're trying to solve. Why do
                      you need to (or think you need to) determine whether a file is "binary"
                      or "text"? Why would your application fail if it received a
                      (binary/text) file when it expected a (text/binary) one?

                      My guess is that the trait you are trying to identify will prove not to
                      be "binary or text", but something more application-specific.

                      -- Graham

                      P.S. Sami, it's very bad form to "make up" an e-mail address, such as
                      <none@none.net> . I'm sure the owners of the none.net domain would agree.
                      Can't you provide a real address?



                      Comment

                      • Peter Hansen

                        #12
                        Re: Determine file type (binary or text)

                        Grant Edwards wrote:[color=blue]
                        >
                        > In article <3F3A8275.8B6EE 8C4@engcorp.com >, Peter Hansen wrote:
                        >[color=green]
                        > > "Contains only printable characters" is probably a more useful definition
                        > > of text in many cases.[/color]
                        >
                        > The definition of "printable" is dependent on the character
                        > set, that will have to be specified.[/color]

                        That's why I said "printable (in ASCII)" in another message, so I
                        definitely agree. The problem was rather under-specified. :-)

                        Comment

                        • John Machin

                          #13
                          Re: Determine file type (binary or text)

                          "Michael Peuser" <mpeuser@web.de > wrote in message news:<bhdaks$f9 2$07$1@news.t-online.com>...[color=blue]
                          >
                          > When I had a similar problem I read 1000 characters, counted the amount of
                          > <32 and >255 characters and classified it "binary when this qota exceeded[/color]

                          How many characters > 255 did you get? Did you mean 127? If so, what
                          about accented characters ... like umlauts?

                          On a slightly more serious note, CR, LF, HT and FF would have to be
                          considered "text" but their ordinal values are < 32.

                          What was the problem that you thought you were solving?

                          Comment

                          • John Machin

                            #14
                            Re: Determine file type (binary or text)

                            Trent Mick <trentm@ActiveS tate.com> wrote in message news:<mailman.1 060797503.18604 .python-list@python.org >...
                            [color=blue]
                            > Generally I define a text file as "it has no null bytes". I think this
                            > is a pretty safe definition (I would be interested to hear practical
                            > experience to the contrary).[/color]

                            Data file written by C program which has an off-by-one error and is
                            including a trailing '\0' byte ...

                            Comment

                            • John Machin

                              #15
                              Re: Determine file type (binary or text)

                              Graham Fawcett <fawcett@teksav vy.com> wrote in message news:<mailman.1 060799361.14244 .python-list@python.org >...[color=blue]
                              >
                              > It is trivial to create a non-text file that has no NULs.
                              >
                              > f = open('no_zeroes .bin', 'rb')
                              > for x in range(1, 256):
                              > f.write(chr(x))
                              > f.close()[/color]

                              I tried this but it didn't work. It said:

                              IOError: [Errno 2] No such file or directory: 'no_zeroes.bin' .

                              So I thought I had to be persistent but after doing it a few more times it said:

                              SerialIdiotErro r: What I tell you three times is true.
                              NotLispingError : You need 'wb' as in 'wascally wabbit'

                              This is very strange behaviour -- does my computer have worms?

                              Comment

                              Working...