pull data from a pdf file to store in sql

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • kthequeen
    New Member
    • Dec 2007
    • 8

    pull data from a pdf file to store in sql

    Is it possible to pull data( text contents and the file attribues, like filename ) from a pdf file and store in sql?
    ..using c#

    I have web app with 100+ pdf files that I need keyword search capability for. It would produce results with link(s) to the corresponding pdf file. Not sure if this is possible.

    thanks!
  • Shashi Sadasivan
    Recognized Expert Top Contributor
    • Aug 2007
    • 1435

    #2
    it will be possible.
    Though do you also want to read text contained inside the PDF file?

    You would have to create a seperate program (could be console or windows application based, or Asp .Net also if you want) use the DirectoryInfo Class and fetch all the files contained in the directory using FileInfo.
    Once you have all the files you wamt, you can insert them into your datatbase table

    Comment

    • kthequeen
      New Member
      • Dec 2007
      • 8

      #3
      Originally posted by Shashi Sadasivan
      it will be possible.
      Though do you also want to read text contained inside the PDF file?
      Yes I would like to read the actual text inside the pdf. I found some info on how to convert to a text file. Perhaps I can do that and import to sql. I would just need a filename column that corresponds to the exported (text) from the pdf. What do you think?

      Comment

      • Shashi Sadasivan
        Recognized Expert Top Contributor
        • Aug 2007
        • 1435

        #4
        Hi,
        Since you have a lot of PDF files, and there would be significant amount of text in it, i think that storing all the text in the database, and searching for text within that will take a lot of time.

        Have you looked any of the desktop search API's ?

        Google provides one, but I havent looked into it, and am not sure on how you would integrate, but it would be a easier way out (You would have to keep all the PDF files within the same folder, or atleast should be within the same root path.

        Comment

        • kthequeen
          New Member
          • Dec 2007
          • 8

          #5
          hi Shashi, thank you for the replies. I'll take a look at those APIs.
          merry Christmas!

          Comment

          • diegomaradona21
            New Member
            • Dec 2007
            • 4

            #6
            Originally posted by kthequeen
            Is it possible to pull data( text contents and the file attribues, like filename ) from a pdf file and store in sql?
            ..using c#

            I have web app with 100+ pdf files that I need keyword search capability for. It would produce results with link(s) to the corresponding pdf file. Not sure if this is possible.

            thanks!
            you can easily get text from PDF files using PDFBox library.
            use google to find out how to use it in .NET2.0 because natively it's Java library.
            you will also need IKVM.GNU

            try this
            how to use pdfbox with c#

            Comment

            • kthequeen
              New Member
              • Dec 2007
              • 8

              #7
              Diego thank you very much! very helpful.

              Comment

              Working...