How to store and compare fingerprint scans?

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • MMcCarthy
    Recognized Expert MVP
    • Aug 2006
    • 14387

    How to store and compare fingerprint scans?

    I have an upcoming project which requires the recording of thumbprint/fingerprint scans in a database (currently proposed as sql server). Now I have a few questions I need to explore:
    • How does sql server handle the embedding of images? Is it better to embed the scaned images in the database or to embed the path and store the images in a folder? My concern is the number of files could eventually reach 7 million and the data would need to be queried.
    • What software or otherwise would I need to be able to "read" a thumbprint/fingerprint image and compare it to all records in the database for duplication?


    I am open to any suggestions from storage datatype to software to solve these problems. I should mention that I haven't finalised the hardware method of getting the scans. I'm currently exploring the capacity of laptops with thumbprint security capacity as to whether it can be used to read and store multiple thumbprints.

    All suggestions welcome.

    Mary
  • Den Burt

    #2
    Storing Images verses Links

    Although I haven't touched on fingerprint comparison yet I have recently thought about it. Obviously I can't help much with that portion yet. I think you may be hitting on something with the laptop since they have that capability and may have an API you can tie into.

    For the images you should take a look at the following paragraph/article. Personally for many images I generally use a pointer to the image due to size constraints.

    Although this article references VB as a language the concept and database information does pertain to SQL server as well.


    The obvious advantage to storing images as a file pointer is that only the file path is saved. As a result, your database won't grow as dramatically as it would if you stored the image in a BLOB field. In the example described earlier, with 100 records of 50K images stored in BLOB fields, the database grew to more than 4 MB. The same database using file pointers instead was under 100K. In speed comparisons, the file pointer method is the winner, completing the test in five seconds. hese advantages generally make file pointers the preferred method of saving images.
    Last edited by MMcCarthy; Nov 4 '10, 03:48 AM.

    Comment

    • Mariostg
      Contributor
      • Sep 2010
      • 332

      #3
      Couple of things that come to my mind for image comparison would be md5 checksum and numpy.

      Comment

      • MMcCarthy
        Recognized Expert MVP
        • Aug 2006
        • 14387

        #4
        From further research it seems I need to look at a middleware solution. The images of the fingerprints could be stored in some folder with a pointer only in the database. However, the comparisons depend on "minutia" and pattern data which would be stored in the database.

        So the comparisons are not of the actual images but rather the patterns and minutia points unique to each image. Extracting that data and converting it to a format for storage would require some kind of middleware solution. So that's what I'm concentrating on at the moment.

        Comment

        • mshmyob
          Recognized Expert Contributor
          • Jan 2008
          • 903

          #5
          In relation to storage of the image files. I am a big believer of storing files outside the database.

          If you are using a SQL Server 2008 installation they have a new feature called FILESTREAM that is specifically optimized for working with images.

          If you want more info on the pros and cons of the three ways I can write something up for you.

          cheers.

          Comment

          • NeoPa
            Recognized Expert Moderator MVP
            • Oct 2006
            • 32633

            #6
            That sounds as if it might be interesting Mshmyob.

            Bearing in mind the number of files could possibly be so large that the file system would not be an appropriately efficient way to access them (Windows does a reasonably efficient job for reasonable numbers of files but is not designed to handle numbers in the millions too well).

            Comment

            • mshmyob
              Recognized Expert Contributor
              • Jan 2008
              • 903

              #7
              True Neo but keep in mind that a database is even less efficient at handling large image (or media files). Also a proper SQL installation will be using Windows Server and probably a RAID 10. I would also probably incorporate a seperate RAID 0 for striping these large amounts of files and thereby increase read/write performance just for these files. (fault tolerance would have to be planned out of course)

              The file system should also be set to NTFS.

              Comparison of the 3 techniques:

              1. Storing the image file (BLOB) in the SQL Database

              SQl has an 8k page size (which limits the maximum size of each record).
              Therefore SQL Server cannot store simple image files in a row like normal records.
              Therefore SQL Server is forced to break the BLOB into 8K chunks and store them in a B-Tree structure with pointers.
              Databases can become extremely large and unmanagable
              BLOB max size is 2G
              Advantage: The BLOBs are transactionally consistent with the data.(this has to do with backing up - transaction logs etc.)

              2. Storing the image file (BLOB) in the file system
              Just add a link in the record where the BLOB is.
              Gives storage simplicity and good performance.
              Disadvantage: Not transactionally consistent. ie: The BLOB is not syncronized with the data. Not good for backups.

              3. Using FileStream
              Combines the benefits of each above.
              Stores in the file system
              BLOB size is limited by file system
              Full transactional consistency exists between the BLOB and the database record
              to which it’s attached.
              BLOBs are included in backup and restore
              BLOB objects are accessible via both T-SQL and NTFS streaming APIs
              Great streaming performance is provided for large BLOB types
              The Windows system cache is used for caching the BLOB data, thus freeing up
              the SQL Server buffer cache required for in-database BLOB storage
              Disadvantage: Database mirroring cannot be enabled
              Snapshots cannot include Filestream data
              Filestream data cannot be encrpypted (TDE)


              The SQL Server Books Online (BOL) has more details. This is just a quick summary.

              cheers,

              Comment

              • MMcCarthy
                Recognized Expert MVP
                • Aug 2006
                • 14387

                #8
                Very interesting information mshmyob. I had forgotten to consider how the storage would affect the backup. As you say the database will be on windows server. Anyway I will do further research on these three methods as you suggest.

                Comment

                • NeoPa
                  Recognized Expert Moderator MVP
                  • Oct 2006
                  • 32633

                  #9
                  Originally posted by Mshmyob
                  Mshmyob:
                  True Neo but keep in mind that a database is even less efficient at handling large image (or media files). Also a proper SQL installation will be using Windows Server and probably a RAID 10. I would also probably incorporate a separate RAID 0 for striping these large amounts of files and thereby increase read/write performance just for these files. (fault tolerance would have to be planned out of course)
                  First let me say thank you for the response. It's very helpful.

                  I would question some of the statements though. Not because they're not true generally, but because I expect the sheer number (rather than size) of the files would make this quite an unusual scenario.

                  In a file system, the indexing is basically linear. This works very well for smaller numbers, and with the caching involved in machines with much more RAM, even pretty darn well for quite large numbers. When humongous numbers are used though, I would expect (theoretically, as I have never had anything quite like this to deal with) that the performance would drop off sharply. If each entry must be checked until a match is found then this would suffer, particularly when the limits of the caching were reached. A database, on the other hand, far from being less efficient at this, could index the filenames and bring to bear all the optimisations that have been developed over the years to make such a search as quick as possible. Caching would also come into play here of course, but I would expect the searching capabilities of a database system to out-perform those of a file system, particularly when scaled up to the extremely large loads anticipated. Maybe my understanding of how things work is off somewhere, but I would expect to see things as I describe if my ideas are borne out.

                  For reasons just described, I would question the suggested benefit of performance in point #2.

                  All that said, your post was still very helpful and I really don't want this to sound like I'm being ungrateful. I'm actually quite interested in anything you may say to indicate where some of my basic thinking may be awry. I need to know if I'm on the wrong lines.

                  PS. I almost forgot the RAID comments quoted. Everything you say about it is true, but I would expect this to leave the playing field still level as the benefits would apply equally to all possible solutions. Again, let me know if I'm missing a point here. It's perfectly possible.
                  Last edited by NeoPa; Nov 6 '10, 03:24 PM. Reason: Added PS.

                  Comment

                  • mshmyob
                    Recognized Expert Contributor
                    • Jan 2008
                    • 903

                    #10
                    Neo you are right in everything you say. I have never worked with this many files either but I have worked with over 130,000 audio files and the issues you mention are valid.

                    I have found that by creating directories for related sets of files make for a massive performance boost. So for instance there could be a directory for each letter of the alphabet and any file starting with each letter goes into the appropriate directory.

                    In my case I created a directory for each musical genre and then each of those directories had a directory for each artist and then came the actual audio files.

                    This scenario gave responses as instataneous as a regular size directory.

                    Without researching it I assumed that Windows works by each directory level, so therefore the root had about 20 directories (for genres), the next set of directories was artists (so the largest sub directory had a couple hundred artists directories at most), then cames the files themselves (largest a few hundred fiels).

                    Therefore by the time it got to the largest directory in terms of files it only needed to index a few hundred files.

                    I could be wrong with my thinking but it worked for me (lol).

                    You could also spread the files over multiple physical disks so you don't have 7 million on one disk therefore Windows would only need to look at subsets of files.


                    About the backup - it is to me a very important part of a mission critical application and using the file system without the new Filestream may generate transaction inconsistancies which makes roll forwarding and roll backs useless and therefore in my opinion the backup process is almost useless.

                    cheers,

                    Comment

                    • mshmyob
                      Recognized Expert Contributor
                      • Jan 2008
                      • 903

                      #11
                      I would also like to build on the so called disadvantages I said about using the FileStream.

                      No database mirroing allowed - not an issue since you are using RAID. The data is still redundant and recoverable just can't mirror the individual database using SQL itself.

                      No snapshots allowed - who cares (lol)

                      Can't use TDE - Only the Filestream data is not encrypted - other data is.

                      cheers,

                      Comment

                      • NeoPa
                        Recognized Expert Moderator MVP
                        • Oct 2006
                        • 32633

                        #12
                        I very much agree with what you say Mshmyob. I was considering earlier the solution of using a directory structure, but couldn't think of a suitable approach. The more I think of that the less impressed I am with that thinking. Even assuming worst case scenario there would need to be a pretty long number associated as a PK for all this data and a very basic string of the first (and for other levels subsequent) digits could be used for subfolder names. Certainly it should only be the individual folders that need to be kept to some maximum of files.

                        Also, whatever RAID system is used (other than simple striping of course) will, as you quite rightly say, provide the requisite level of data redundancy for the project.

                        Thanks for all your help :-)

                        Comment

                        • mshmyob
                          Recognized Expert Contributor
                          • Jan 2008
                          • 903

                          #13
                          We'll let Mary decide the way to go since she is the one being paid the BIG bucks to come up with the proper solution :-)

                          cheers,

                          Comment

                          • MMcCarthy
                            Recognized Expert MVP
                            • Aug 2006
                            • 14387

                            #14
                            OK Guys

                            To be a bit more precise.

                            If I want to store 1 million records with 1 thumbprint and 1 photo each. What is the best way to store those images to achieve the fastest search speed on the database.

                            In other words, putting database size to one side would storing the images in SQL Server rather than in a file give me greater search speed.

                            I believe the biggest problem I have is that as each record is added to SQL Server the database would first have to be searched for existing fingerprints.

                            So all data transfer (importation) and searching would be done by SQL server on the server. Consider the fingerprint data as a unique index.

                            I hope this make sense.

                            Mary

                            Comment

                            • mshmyob
                              Recognized Expert Contributor
                              • Jan 2008
                              • 903

                              #15
                              How big you think the image files will be?

                              New thought: Is this database mostly used for queries? If so would an OLAP database fit the bill better than a transaction database. By using Analysis Services you would probably eliminate problems of performance even with the images in the database (depending on size of image).

                              cheers,
                              Last edited by mshmyob; Nov 9 '10, 12:04 AM. Reason: added New thought:

                              Comment

                              Working...