Read a PDF and print content in console

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • freddieMaize
    New Member
    • Aug 2008
    • 85

    Read a PDF and print content in console

    Hi All,

    Wondering if a PDF can be read and the content inside it can be written into a txt file. Fow now i'm just giving a sys out. Below is my attempt,

    Code:
    public static void main(String args[]) throws IOException {
    		FileInputStream fis = new FileInputStream(new File("c:\\zoutput.pdf"));
    		ByteArrayOutputStream docContents = new ByteArrayOutputStream();
    		byte[] buffer = new byte[16384];
    		int bytesRead = fis.read(buffer);
    		while (bytesRead > -1) {
    			docContents.write(buffer, 0, bytesRead);			
    			bytesRead = fis.read(buffer);
    		}
    		System.out.println(docContents.toString("UTF-8"));
    	}
    sry if the question is silly.

    Freddie
  • Oralloy
    Recognized Expert Contributor
    • Jun 2010
    • 988

    #2
    Freddie,

    PDF files are a mixture of text and binary data, depending on whether there's any compression.

    The basic (think "hello world") PDF file is a plain text file - you can find a copy of the file specification document at adobe.com.

    What are you trying to accomplish, may I ask?

    Comment

    • freddieMaize
      New Member
      • Aug 2008
      • 85

      #3
      Sure you can ask..

      The actual purpose is, I'm trying to index documents into a Search Engine for which i need to read the contents of a PDF (and also other formates like docx, doc, ppt, pptx and list goes on). All the read content should be put to a String which is then pused to the Search Engine. Currently we are using Apache Tika for this. But was just thinking if a simple ByteArrayOutput Stream could slove the issue..

      Thanks for responding..

      Freddie

      Comment

      • Oralloy
        Recognized Expert Contributor
        • Jun 2010
        • 988

        #4
        Well, if the search engine is one doing the parsing, then you should be fine with a ByteArrayOutput Stream.

        I'm not sure why you're converting to UTF-8 in your output, though. Be forewarned that PDF documents may contain binary data, so converting them to UTF-8 will damage their contents. The binary data is usually images and sounds, which might not be critical to you, however, it's also in the document specification that sequences of arbitrary text objects can be compressed.

        Good Luck!

        Comment

        • freddieMaize
          New Member
          • Aug 2008
          • 85

          #5
          Thanks for responding !!

          Well, i need to parse if myself since, that way we can customize the search engine better.. Well it ll get complicated if i need to explain what EXACTLY im trying to do. Also would not be that necessary..

          And regarding UTF-8, no specific reason. It was one of the trail which ended in error :)

          Anyways, I'm trying my best out and ll be sure to post back if i find a solution. Thanks Oralloy and all..

          Freddie

          Comment

          • Oralloy
            Recognized Expert Contributor
            • Jun 2010
            • 988

            #6
            Freddie,

            Try to avoid parsing PDF files, if you can. They are not difficult to manipulate, however they are complex and have more than a few gocchas. Go buy a good tool to do it for you. The money you spend will be money well spent.

            Anyway, good luck with your quest.

            Comment

            Working...