Identifying File type by reading files

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • hokiegal99

    Identifying File type by reading files

    This is not really a Python-centric question, however, I am using
    Python to solve this problem (as of now) so I thought it appropiate to
    pose the question here.

    I have some functions that search for files that contain certian
    strings and if the files found to have these string do not already
    have a filename extension (such as '.doc' or '.xls') the function will
    append that to the files and rename them. So, if a file named 'report'
    was found to have the string 'Microsoft' and the string
    'Word.Document. ' (notice the '.' at the end of both words) and it does
    not already have an extension, then a rename would take place that
    would name the file 'report.doc'

    These functions work very well on most files (98% guessed correctly).
    However, I would like the functions to be more precise (100%). So,
    what should I look for in a file to determine whether or not it is a
    MS Word file or an Excel file or a PDF file, etc., etc.? Below is a
    list of some of the strings I use to ID files, but I can't help but
    wonder that there must be a more precise way of doing this. I know of
    the Unix 'file' command. It is not very useful for me as it doesn't
    distinguish between MS Office documents... all .xls, .docs, .ppts are
    MS documents to it.

    Are there certain sets of binary data that are unique to files that
    would be a better way of identifying them? For example, on the N line
    of a MS doc file begining at position X a binary string that is L
    digits in lentgh that begins with B and ends with E will *ALWAYS* be
    present... some one tell me that I'm not dreaming and that something
    like the above example exists???

    A few of my string searches today:

    doc = string.find(fil e(os.path.join( root,fname), 'rb').read(),
    'Word.Document. ')
    xls = string.find(fil e(os.path.join( root,fname), 'rb').read(),
    'Excel.Sheet.')
    pdf = string.find(fil e(os.path.join( root,fname), 'rb').read(),
    'PDF-1.')
    jpg = string.find(fil e(os.path.join( root,fname), 'rb').read(), 'JFIF')

    Any suggestions or information that better describes how to positively
    ID files w/o the possibiliy of mistake would be very helpful to me. As
    of now, some of my files, though not many (~ 2%) will be given the
    wrong extension, but the logic of the functions is such that they
    append any extension that probably applies to the file so at that
    point it is a simple process of elimination to determine which
    extension is actually the correct one. Normally, I never have more
    than 2 unique extensions attached to the same file.

    Thank you!!!
  • Andrew Dalke

    #2
    Re: Identifying File type by reading files

    hokiegal99:[color=blue]
    > what should I look for in a file to determine whether or not it is a
    > MS Word file or an Excel file or a PDF file, etc., etc.? Below is a
    > list of some of the strings I use to ID files, but I can't help but
    > wonder that there must be a more precise way of doing this. I know of
    > the Unix 'file' command. It is not very useful for me as it doesn't
    > distinguish between MS Office documents... all .xls, .docs, .ppts are
    > MS documents to it.[/color]

    That likely means you have an incomplete 'magic' file. This is the
    file used by the 'file' command to figure out the file type. Take a
    look at http://www.unixhideout.com/freebsd/share/misc/magic for
    a more complete (I think) version.

    That's dated 1995 and is close the one on my Mac. It doesn't support
    the newer MS Word and Excel formats. I'm having trouble
    finding the most recent, definitive version. One link pointed me
    to ftp://ftp.astron.com/pub/file/ but I haven't investigated it further.

    There's also a pymagic, http://thomas.mangin.me.uk/software/python.html
    which may help for a pure Python implementation of 'file'.

    Andrew
    dalke@dalkescie ntific.com


    Comment

    Working...