Fw: PDF library for reading PDF files

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Peter Galfi

    Fw: PDF library for reading PDF files

    Hi!

    I am looking for a library in Python that would read PDF files and I could extract information from the PDF with it. I have searched with google, but only found libraries that can be used to write PDF files.

    Any ideas?

    Peter
  • Harald Massa

    #2
    Re: Fw: PDF library for reading PDF files

    > I am looking for a library in Python that would read PDF files and I[color=blue]
    > could extract information from the PDF with it. I have searched with
    > google, but only found libraries that can be used to write PDF files.[/color]

    reportlab has a lib called pagecatcher; it is fully supported with python,
    it is not free.

    Harald

    Comment

    • David Boddie

      #3
      Re: Fw: PDF library for reading PDF files

      "Peter Galfi" <galfip@freesta rt.hu> wrote in message news:<mailman.4 64.1074430854.1 2720.python-list@python.org >...
      [color=blue]
      > I am looking for a library in Python that would read PDF files and I
      > could extract information from the PDF with it. I have searched with
      > google, but only found libraries that can be used to write PDF files.
      >
      > Any ideas?[/color]

      I quickly searched back through Google, but I knew exactly what I was
      looking for: ;-)



      The page referred to is here:



      The module is very much a "work in progress". You can probably get
      some text and bitmap images out of a few documents, but that's
      probably all you can expect unless you want to improve it (and
      submit patches).

      Good luck!

      David

      Comment

      • Cameron Laird

        #4
        Re: Fw: PDF library for reading PDF files

        In article <Xns9474CBDE9B2 D7cpl19ghumspam gourmet@62.153. 159.134>,
        Harald Massa <cpl.19.ghum@sp amgourmet.com> wrote:[color=blue][color=green]
        >> I am looking for a library in Python that would read PDF files and I
        >> could extract information from the PDF with it. I have searched with
        >> google, but only found libraries that can be used to write PDF files.[/color]
        >
        >reportlab has a lib called pagecatcher; it is fully supported with python,
        >it is not free.
        >
        >Harald[/color]

        ReportLab's libraries are great things--but they do not "extract
        information from the PDF" in the sense I believe the original
        questioner intended. As Andreas suggested, he's probably best
        off using existing stand-alone applications as separate processes,
        controlled from Python.
        --

        Cameron Laird <claird@phaseit .net>
        Business: http://www.Phaseit.net

        Comment

        • Robert Kern

          #5
          Re: Fw: PDF library for reading PDF files

          Cameron Laird wrote:[color=blue]
          > In article <Xns9474CBDE9B2 D7cpl19ghumspam gourmet@62.153. 159.134>,
          > Harald Massa <cpl.19.ghum@sp amgourmet.com> wrote:
          >[color=green][color=darkred]
          >>>I am looking for a library in Python that would read PDF files and I
          >>>could extract information from the PDF with it. I have searched with
          >>>google, but only found libraries that can be used to write PDF files.[/color]
          >>
          >>reportlab has a lib called pagecatcher; it is fully supported with python,
          >>it is not free.
          >>
          >>Harald[/color]
          >
          >
          > ReportLab's libraries are great things--but they do not "extract
          > information from the PDF" in the sense I believe the original
          > questioner intended.[/color]

          No, but ReportLab (the company) has a product separate from reportlab
          (the package) called PageCatcher that does exactly what the OP asked
          for. It is not open source, however, and costs a chunk of change.

          Comment

          • Cameron Laird

            #6
            Re: Fw: PDF library for reading PDF files

            In article <oxEOb.96911$Vs 3.36407@twister .socal.rr.com>,
            Robert Kern <rkern@ucsd.edu > wrote:[color=blue]
            >Cameron Laird wrote:[color=green]
            >> In article <Xns9474CBDE9B2 D7cpl19ghumspam gourmet@62.153. 159.134>,
            >> Harald Massa <cpl.19.ghum@sp amgourmet.com> wrote:
            >>[color=darkred]
            >>>>I am looking for a library in Python that would read PDF files and I
            >>>>could extract information from the PDF with it. I have searched with
            >>>>google, but only found libraries that can be used to write PDF files.
            >>>
            >>>reportlab has a lib called pagecatcher; it is fully supported with python,
            >>>it is not free.
            >>>
            >>>Harald[/color]
            >>
            >>
            >> ReportLab's libraries are great things--but they do not "extract
            >> information from the PDF" in the sense I believe the original
            >> questioner intended.[/color]
            >
            >No, but ReportLab (the company) has a product separate from reportlab
            >(the package) called PageCatcher that does exactly what the OP asked
            >for. It is not open source, however, and costs a chunk of change.[/color]

            Let's take this one step farther. Two posts now have
            quite clearly recommended ReportLab's PageCatcher <URL:
            http://reportlab.com/docs/pagecatcher-ds.pdf >. I
            completely understand and agree that ReportLab supports
            a mix of open-source, no-fee, and for-fee products, and
            that PageCatcher carries a significant license fee. I
            entirely agree that PageCatcher "read[s] PDF files ...
            and ... extract[s] information from the PDF with it."

            HOWEVER, I suspect that what the original questioner
            meant by his words was some sort of PDF-to-text "extrac-
            tion" (true?) and, unless PageCatcher has changed a lot
            since I got my last copy, PDF-to-text is NOT one of its
            functions.
            --

            Cameron Laird <claird@phaseit .net>
            Business: http://www.Phaseit.net

            Comment

            • Robin Becker

              #7
              Re: Fw: PDF library for reading PDF files

              In article <100nlf2b1qjdae 2@corp.supernew s.com>, Cameron Laird
              <claird@lairds. com> writes
              ......[color=blue][color=green]
              >>No, but ReportLab (the company) has a product separate from reportlab
              >>(the package) called PageCatcher that does exactly what the OP asked
              >>for. It is not open source, however, and costs a chunk of change.[/color]
              >
              >Let's take this one step farther. Two posts now have
              >quite clearly recommended ReportLab's PageCatcher <URL:
              >http://reportlab.com/docs/pagecatcher-ds.pdf >. I
              >completely understand and agree that ReportLab supports
              >a mix of open-source, no-fee, and for-fee products, and
              >that PageCatcher carries a significant license fee. I
              >entirely agree that PageCatcher "read[s] PDF files ...
              >and ... extract[s] information from the PDF with it."
              >
              >HOWEVER, I suspect that what the original questioner
              >meant by his words was some sort of PDF-to-text "extrac-
              >tion" (true?) and, unless PageCatcher has changed a lot
              >since I got my last copy, PDF-to-text is NOT one of its
              >functions.[/color]
              I suspect Cameron is right. ReportLab does have a product called
              pageCatcher, but its main function is to grab individual pages for
              reuse. I believe it could be extended to go deeper and mess about with
              text streams, but it certainly doesn't do that now and would take some
              effort to do properly as text can be complicated in PDF (or postscript).
              --
              Robin Becker

              Comment

              • Andreas Lobinger

                #8
                Re: Fw: PDF library for reading PDF files

                Aloha,
                [color=blue]
                > Peter Galfi schrieb:
                > I am looking for a library in Python that would read PDF files and I
                > could extract information from the PDF with it. I have searched with
                > google, but only found libraries that can be used to write PDF files.
                > Any ideas?[/color]

                Use file, split, zlib and a broad knowledge of the PDF-spec...

                Accessing certain objects in the .pdf is not that complicated if
                you f.e. try to read the /Info dictionary. Getting text from
                actual page content could be very complicated.

                Can you explain your 'information' further?

                Wishing a happy day
                LOBI

                Comment

                • Robert Kern

                  #9
                  Re: Fw: PDF library for reading PDF files

                  Cameron Laird wrote:[color=blue]
                  > In article <oxEOb.96911$Vs 3.36407@twister .socal.rr.com>,
                  > Robert Kern <rkern@ucsd.edu > wrote:
                  >[color=green]
                  >>Cameron Laird wrote:
                  >>[color=darkred]
                  >>>In article <Xns9474CBDE9B2 D7cpl19ghumspam gourmet@62.153. 159.134>,
                  >>>Harald Massa <cpl.19.ghum@sp amgourmet.com> wrote:
                  >>>
                  >>>
                  >>>>>I am looking for a library in Python that would read PDF files and I
                  >>>>>could extract information from the PDF with it. I have searched with
                  >>>>>google, but only found libraries that can be used to write PDF files.
                  >>>>
                  >>>>reportlab has a lib called pagecatcher; it is fully supported with python,
                  >>>>it is not free.
                  >>>>
                  >>>>Harald
                  >>>
                  >>>
                  >>>ReportLab' s libraries are great things--but they do not "extract
                  >>>informatio n from the PDF" in the sense I believe the original
                  >>>questioner intended.[/color]
                  >>
                  >>No, but ReportLab (the company) has a product separate from reportlab
                  >>(the package) called PageCatcher that does exactly what the OP asked
                  >>for. It is not open source, however, and costs a chunk of change.[/color]
                  >
                  >
                  > Let's take this one step farther. Two posts now have
                  > quite clearly recommended ReportLab's PageCatcher <URL:
                  > http://reportlab.com/docs/pagecatcher-ds.pdf >. I
                  > completely understand and agree that ReportLab supports
                  > a mix of open-source, no-fee, and for-fee products, and
                  > that PageCatcher carries a significant license fee. I
                  > entirely agree that PageCatcher "read[s] PDF files ...
                  > and ... extract[s] information from the PDF with it."
                  >
                  > HOWEVER, I suspect that what the original questioner
                  > meant by his words was some sort of PDF-to-text "extrac-
                  > tion" (true?) and, unless PageCatcher has changed a lot
                  > since I got my last copy, PDF-to-text is NOT one of its
                  > functions.[/color]

                  Rereading http://www.reportlab.com/PageCatchIntro.html , you're right.
                  My apologies. I thought you were talking about the open source reportlab
                  package and not PageCatcher specifically.

                  Comment

                  • Peter Galfi

                    #10
                    Re: Fw: PDF library for reading PDF files

                    Thanks. I am studying the PDF spec, it just does not seem to be that easy
                    having to implement all the decompressions, etc. The "informatio n" I am
                    trying to extract from the PDF file is the text, specifically in a way to
                    keep the original paragraphs of the text. I have seen so far one shareware
                    standalone tool that extracts the text (and a lot of other formatting
                    garbage) into an RTF document keeping the paragraphs as well. I would need
                    only the text.

                    Any suggestions?

                    Peter

                    ----- Original Message -----
                    From: "Andreas Lobinger" <andreas.lobing er@netsurf.de>
                    Newsgroups: comp.lang.pytho n
                    To: <python-list@python.org >
                    Sent: Monday, January 19, 2004 5:02 PM
                    Subject: Re: Fw: PDF library for reading PDF files


                    Aloha,
                    [color=blue]
                    > Peter Galfi schrieb:
                    > I am looking for a library in Python that would read PDF files and I
                    > could extract information from the PDF with it. I have searched with
                    > google, but only found libraries that can be used to write PDF files.
                    > Any ideas?[/color]

                    Use file, split, zlib and a broad knowledge of the PDF-spec...

                    Accessing certain objects in the .pdf is not that complicated if
                    you f.e. try to read the /Info dictionary. Getting text from
                    actual page content could be very complicated.

                    Can you explain your 'information' further?

                    Wishing a happy day
                    LOBI
                    --



                    Comment

                    • Josiah Carlson

                      #11
                      Re: PDF library for reading PDF files

                      > Thanks. I am studying the PDF spec, it just does not seem to be that easy[color=blue]
                      > having to implement all the decompressions, etc. The "informatio n" I am
                      > trying to extract from the PDF file is the text, specifically in a way to
                      > keep the original paragraphs of the text. I have seen so far one shareware
                      > standalone tool that extracts the text (and a lot of other formatting
                      > garbage) into an RTF document keeping the paragraphs as well. I would need
                      > only the text.
                      >
                      > Any suggestions?[/color]

                      Peter,

                      Suggestion: extract the document to RTF using that other tool, then use
                      any one of the few dozen RTF parsers to convert them into plaintext.

                      - Josiah

                      Comment

                      • Andreas Lobinger

                        #12
                        Re: Fw: PDF library for reading PDF files

                        Aloha,

                        Peter Galfi schrieb:[color=blue]
                        > Thanks. I am studying the PDF spec, it just does not seem to be that easy
                        > having to implement all the decompressions, etc. The "informatio n" I am
                        > trying to extract from the PDF file is the text, specifically in a way to
                        > keep the original paragraphs of the text. I have seen so far one shareware
                        > standalone tool that extracts the text (and a lot of other formatting
                        > garbage) into an RTF document keeping the paragraphs as well. I would need
                        > only the text.[/color]

                        As others wrote here, the simplest solution is to use a external
                        pdf-2-text programm and postprocess the data. Read comp.text.pdf

                        There is no simple and consistent way to extract text from a .pdf
                        because there are many ways to set text. The optical impression
                        of a paragraph may not be represented by a similar command structure
                        in the .pdf.

                        Adobe recognized the difficulties for document reuse and introduced
                        tagged .pdf in 1.4. With tagged-pdf it is possible to insert
                        structural information in the .pdf. If you are interested in
                        using this, contact me.

                        Wishing a happy day
                        LOBI

                        Comment

                        • Cameron Laird

                          #13
                          Re: Fw: PDF library for reading PDF files

                          In article <400CF2E3.29506 EAE@netsurf.de> ,
                          Andreas Lobinger <andreas.lobing er@netsurf.de> wrote:[color=blue]
                          >Aloha,
                          >
                          >Peter Galfi schrieb:[/color]

                          Comment

                          • Dennis Lee Bieber

                            #14
                            Re: Fw: PDF library for reading PDF files

                            On Tue, 20 Jan 2004 08:59:03 +0100, "Peter Galfi" <galfip@freesta rt.hu>
                            declaimed the following in comp.lang.pytho n:
                            [color=blue]
                            >
                            > Any suggestions?
                            >[/color]
                            Configure a text-only printer, with "print to file" capability,
                            and "print" the PDF file to it... Then read the print-out...


                            --[color=blue]
                            > =============== =============== =============== =============== == <
                            > wlfraed@ix.netc om.com | Wulfraed Dennis Lee Bieber KD6MOG <
                            > wulfraed@dm.net | Bestiaria Support Staff <
                            > =============== =============== =============== =============== == <
                            > Home Page: <http://www.dm.net/~wulfraed/> <
                            > Overflow Page: <http://wlfraed.home.ne tcom.com/> <[/color]

                            Comment

                            • Jeff Sandys

                              #15
                              Re: Fw: PDF library for reading PDF files

                              Peter Galfi wrote:[color=blue]
                              >[/color]
                              ....[color=blue]
                              > The "informatio n" I am trying to extract from the PDF file is the text,
                              > specifically in a way to keep the original paragraphs of the text.[/color]
                              ....[color=blue]
                              >
                              > Any suggestions?[/color]

                              Ghostscript has an Extract Text capability that I have used
                              successfully on some pdf files (but not on some others):


                              Thanks,
                              Jeff Sandys

                              Comment

                              Working...