Determine file type (binary or text)

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Graham Fawcett

    #16
    Re: Determine file type (binary or text)

    John Machin wrote:
    [color=blue]
    >Graham Fawcett <fawcett@teksav vy.com> wrote in message news:<mailman.1 060799361.14244 .python-list@python.org >...
    >
    >[color=green]
    >>It is trivial to create a non-text file that has no NULs.
    >>
    >> f = open('no_zeroes .bin', 'rb')
    >> for x in range(1, 256):
    >> f.write(chr(x))
    >> f.close()
    >>
    >>[/color]
    >
    >I tried this but it didn't work. It said:
    >
    >IOError: [Errno 2] No such file or directory: 'no_zeroes.bin' .
    >
    >So I thought I had to be persistent but after doing it a few more times it said:
    >
    >SerialIdiotErr or: What I tell you three times is true.
    >NotLispingErro r: You need 'wb' as in 'wascally wabbit'
    >
    >This is very strange behaviour -- does my computer have worms?
    >
    >[/color]

    No, but my brain does. Glad you caught my typo.

    However, it looks like your computer definitely has an AttitudeError!

    -- Graham



    Comment

    • Peter Hansen

      #17
      Re: Determine file type (binary or text)

      John Machin wrote:[color=blue]
      >
      > Trent Mick <trentm@ActiveS tate.com> wrote in message news:<mailman.1 060797503.18604 .python-list@python.org >...
      >[color=green]
      > > Generally I define a text file as "it has no null bytes". I think this
      > > is a pretty safe definition (I would be interested to hear practical
      > > experience to the contrary).[/color]
      >
      > Data file written by C program which has an off-by-one error and is
      > including a trailing '\0' byte ...[/color]

      To be fair, I'd call that a "binary" file in any case, or at least
      a defective text file...

      Comment

      • Brian Lenihan

        #18
        Re: Determine file type (binary or text)

        Peter Hansen <peter@engcorp. com> wrote in message news:<3F3A8275. 8B6EE8C4@engcor p.com>...
        [color=blue]
        > "Contains only printable characters" is probably a more useful definition
        > of text in many cases. I can't say off the top of my head exactly when
        > either definition might be a problem.... wait, how about this one: in
        > CVS, if you don't have a file that is effectively line-oriented, human
        > readable information, you probably don't want to let it be treated as
        > "text" and stored as diffs. In that situation, "contains primarily
        > printable characters organized in lines" is probably a more thorough,
        > though less deterministic, definition.[/color]

        We check for binary files in our CVS commitprep script like this:

        look for -kb arg
        open the file in binary mode, read 4k fom the file and...

        for i in range(len(buff) ):
        a = ord(buff[i])
        if (a < 8) or (a > 13 and a < 32) or (a > 126):
        non_text = non_text + 1

        If 10 percent of the characters are found to be non-text, we reject
        the file if it was not commited with the -kb flag, or print a warning
        if the file appears to be text but is being checked in as a binary.

        We don't bother checking for charsets other than ascii, because
        localized files have to be checked in as binaries or bad things
        (tm) happen.

        Comment

        • Sami Viitanen

          #19
          Re: Determine file type (binary or text)

          Thanks for the answers.

          To be more specific I'm making a script that should
          identify binary files as binary and text files as text.

          The script is for automating CVS commands and
          with CVS you have to add the -kb flag to
          add (or import) binary files. (because it can't itself
          determine what type the file is). If binary file is not
          added with -kb the results are awful.

          Script example usage:
          -import.py <directory_name >

          Script makes list of all files under that directory
          and then determines each files filetype. After that
          all files are added with Add command and binary
          files get that additional -kb automatically.


          "Sami Viitanen" <none@none.ne t> wrote in message
          news:AFm_a.9725 $g4.189983@news 1.nokia.com...[color=blue]
          > Hello,
          >
          > How can I check if a file is binary or text?
          >
          > There was some easy way but I forgot it..
          >
          >
          > Thanks in adv.
          >
          >[/color]


          Comment

          • Grant Edwards

            #20
            Re: Determine file type (binary or text)

            In article <gwH_a.1649$k4. 34358@news2.nok ia.com>, Sami Viitanen wrote:
            [color=blue]
            > To be more specific I'm making a script that should
            > identify binary files as binary and text files as text.[/color]

            That's "more specific"? ;)

            --
            Grant Edwards grante Yow! I hope I
            at bought the right
            visi.com relish... zzzzzzzzz...

            Comment

            • Peter Hansen

              #21
              Re: Determine file type (binary or text)

              "David C. Fox" wrote:[color=blue]
              >
              > Sami Viitanen wrote:[color=green]
              > > Thanks for the answers.
              > >
              > > To be more specific I'm making a script that should
              > > identify binary files as binary and text files as text.
              > >
              > > The script is for automating CVS commands and
              > > with CVS you have to add the -kb flag to
              > > add (or import) binary files. (because it can't itself
              > > determine what type the file is). If binary file is not
              > > added with -kb the results are awful.
              > >[/color]
              >
              > You should note that the question of when to use -kb is not simply based
              > on the contents of the file, but on whether you want CVS/RCS to try to
              > merge conflicting versions.
              >
              > For example, I recently added some files containing pickled objects
              > (used as test data sets for a regression test) to the CVS repository for
              > my project. Although the pickle files are in fact all printable text, a
              > CVS/RCS merge of two valid pickle files won't yield a valid pickle file.
              > Therefore, I used -kb to ensure that the developer would always be
              > forced to choose a version in the event of a version conflict.[/color]

              Exactly. We had the same issue with the project files for the Codewright
              text editor. They are sort of like Windows .INI files, but merging such
              files leads to complete disaster, including inability to run Codewright
              until the files are manually fixed or removed!

              -Peter

              Comment

              • JanC

                #22
                OT email address (was: Determine file type (binary or text))

                Graham Fawcett <fawcett@teksav vy.com> schreef:
                [color=blue]
                > P.S. Sami, it's very bad form to "make up" an e-mail address, such as
                > <none@none.net> . I'm sure the owners of the none.net domain would agree.[/color]

                Very true.
                [color=blue]
                > Can't you provide a real address?[/color]

                Some non-real addresses are allowed/harmless too:
                - everything ending with the .invalid TLD
                e.g.: none@none.invalid
                - me@privacy.net (the owner of the domain gave his permission)

                --
                JanC

                "Be strict when sending and tolerant when receiving."
                RFC 1958 - Architectural Principles of the Internet - section 3.9

                Comment

                Working...