Analyse of PDF (or EPS?)

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • Johan Holst Nielsen

    Analyse of PDF (or EPS?)

    Hi,

    Is there any Python packages to analyse or get some information out of
    an PDF document...

    Like where the text are placed - what text are placed - fonts, embedded
    PDFs/fonts/images etc.

    Please let me know :)

    Regards,
    Johan

  • Peter Hansen

    #2
    Re: Analyse of PDF (or EPS?)

    Johan Holst Nielsen wrote:[color=blue]
    >
    > Is there any Python packages to analyse or get some information out of
    > an PDF document...
    >
    > Like where the text are placed - what text are placed - fonts, embedded
    > PDFs/fonts/images etc.
    >
    > Please let me know :)[/color]

    I believe the not-for-free version of ReportLab has this sort of capability,
    at least in some sense.

    -Peter

    Comment

    • Johan Holst Nielsen

      #3
      Re: Analyse of PDF (or EPS?)

      Peter Hansen wrote:
      [color=blue]
      > Johan Holst Nielsen wrote:
      >[color=green]
      >>Is there any Python packages to analyse or get some information out of
      >>an PDF document...
      >>
      >>Like where the text are placed - what text are placed - fonts, embedded
      >>PDFs/fonts/images etc.
      >>[/color]
      >
      > I believe the not-for-free version of ReportLab has this sort of capability,
      > at least in some sense.[/color]

      Aah, you think about the product "PageCatche r", right? :)

      I haven't seen it yet :) I will contact ReportLab for further details,
      thanks :)

      Please let me know, if other know any alternatives ;) (in case that I
      cannot use ReportLab's version)

      Regards,
      Johan

      Comment

      • Johan Holst Nielsen

        #4
        Re: Analyse of PDF (or EPS?)

        Johan Holst Nielsen wrote:
        [color=blue]
        > Peter Hansen wrote:
        >[color=green]
        >> Johan Holst Nielsen wrote:
        >>[color=darkred]
        >>> Is there any Python packages to analyse or get some information out of
        >>> an PDF document...
        >>>
        >>> Like where the text are placed - what text are placed - fonts, embedded
        >>> PDFs/fonts/images etc.
        >>>[/color]
        >>
        >> I believe the not-for-free version of ReportLab has this sort of
        >> capability,
        >> at least in some sense.[/color]
        >
        >
        > Aah, you think about the product "PageCatche r", right? :)[/color]

        Just found the pricing :( I think USD 25,000 are way out of my budget :(
        I have someone have some alternatives :)

        Regards,
        Johan

        Comment

        • Johan Holst Nielsen

          #5
          Re: Analyse of PDF (or EPS?)

          Grzegorz Makarewicz wrote:
          [color=blue]
          > Johan Holst Nielsen wrote:
          >[color=green]
          >> Hi,
          >>
          >> Is there any Python packages to analyse or get some information out of
          >> an PDF document...
          >>
          >> Like where the text are placed - what text are placed - fonts,
          >> embedded PDFs/fonts/images etc.
          >>
          >> Please let me know :)
          >>
          >> Regards,
          >> Johan
          >>[/color]
          >
          > http://www.trisoft.com.pl/~mak/wxpdf.zip
          >
          > My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of
          > python and wxPython - binaries for python22 (windows) are included.[/color]

          Hmmm

          Not Found
          The requested URL /~mak/wxpdf.zip was not found on this server.
          :( Did I get the wrong URL :(

          Regards,
          Johan

          Comment

          • Johan Holst Nielsen

            #6
            Re: Analyse of PDF (or EPS?)

            David Boddie wrote:[color=blue][color=green][color=darkred]
            >>>Is there any Python packages to analyse or get some information out of
            >>>an PDF document...
            >>>
            >>>Like where the text are placed - what text are placed - fonts, embedded
            >>>PDFs/fonts/images etc.[/color][/color]
            >
            > It depends on the type of images (bitmap vs. vector).[/color]

            Yes I know - but the vector based images should be extracted just as it
            is - bitmap as selfcontained files :=)
            [color=blue]
            >[color=green]
            >>IIRC you can get the full specs of pdf and eps at the adobe site.[/color]
            >
            > The full PDF specification is not exactly short, but it's fairly readable.[/color]

            Yep... I tried it... but there are no reason to do exactly the same - if
            other people already have done that. And time is an issue too ;)
            [color=blue]
            >[color=green]
            >>Some stuff is easy to get at, some may be compressed and/or encrypted,
            >>and not so easy.[/color]
            >
            > Although the FlateDecode compression format is straightforward with existing
            > libraries, some of the other compression techniques may be less accessible.[/color]

            Well, no problem with the compression/encrypting. It is for an internal
            application - so people just HAVE to not encrypt or secure the document.
            [color=blue][color=green]
            >>Conforming docs are supposed to be structured so that it is relatively easy
            >>to grab chunks of document and do the kinds of things printing business s/w does,
            >>like rotating and scaling and reordering pages, etc.[/color]
            >
            > I have a Python library which is able to identify a lot of the structure in simple
            > documents, including basic text extraction, but I've become pretty disillusioned
            > with it because so much work is required to extract more complex information.
            >
            > Maybe it's time to stick a license on it and upload it somewhere.[/color]

            Well, let me know ;) Maybe I could get an demo or something? That would
            be nice :)

            Regards,
            Johan

            Comment

            • Johan Holst Nielsen

              #7
              Re: Analyse of PDF (or EPS?)

              Grzegorz Makarewicz wrote:[color=blue]
              > Johan Holst Nielsen wrote:[color=green]
              >> Is there any Python packages to analyse or get some information out of
              >> an PDF document...
              >>
              >> Like where the text are placed - what text are placed - fonts,
              >> embedded PDFs/fonts/images etc.
              >>
              >> Please let me know :)[/color]
              >
              > http://www.trisoft.com.pl/~mak/wxpdf.zip
              >
              > My first attempt to decode PDF-s with SWIG-ged xpdf, requires sources of
              > python and wxPython - binaries for python22 (windows) are included.[/color]

              Not Found
              The requested URL /~mak/wxpdf.zip was not found on this server.

              :( Can you please try to upload it again?

              Regards,
              Johan

              Comment

              • Grzegorz Makarewicz

                #8
                Re: Analyse of PDF (or EPS?)

                Johan Holst Nielsen wrote:
                [...]
                [color=blue]
                > Not Found
                > The requested URL /~mak/wxpdf.zip was not found on this server.
                >
                > Can you please try to upload it again?
                >
                > Johan
                >[/color]

                Sorry for the missing link, this one works:



                Regards,
                Grzegorz Makarewicz


                Comment

                • Johan Holst Nielsen

                  #9
                  Re: Analyse of PDF (or EPS?)

                  Grzegorz Makarewicz wrote:[color=blue]
                  > Johan Holst Nielsen wrote:
                  > [...]
                  >[color=green]
                  > > Not Found
                  > > The requested URL /~mak/wxpdf.zip was not found on this server.
                  > >
                  > > Can you please try to upload it again?
                  > >
                  > > Johan
                  > >[/color]
                  >
                  > Sorry for the missing link, this one works:
                  >
                  > http://www.trisoft.com.pl/mak/wxpdf.zip[/color]

                  Thanks Grzegorz, I will look at it in next week. If you want an reply
                  about if I can use - please send a message to me at tcr480 ( a t )
                  yahoo.dk


                  Regards,
                  Johan

                  Comment

                  • David Boddie

                    #10
                    Re: Analyse of PDF (or EPS?)

                    Johan Holst Nielsen <johan@weknowth ewayout.com> wrote in message news:<3fbe00e8$ 0$95070$edfadb0 f@dread11.news. tele.dk>...[color=blue]
                    > David Boddie wrote:[/color]
                    [color=blue][color=green]
                    > > The full PDF specification is not exactly short, but it's fairly readable.[/color]
                    >
                    > Yep... I tried it... but there are no reason to do exactly the same - if
                    > other people already have done that. And time is an issue too ;)[/color]

                    Time is always an issue. How much of it do you have? ;-)
                    [color=blue][color=green]
                    > > I have a Python library which is able to identify a lot of the structure in simple
                    > > documents, including basic text extraction, but I've become pretty disillusioned
                    > > with it because so much work is required to extract more complex information.
                    > >
                    > > Maybe it's time to stick a license on it and upload it somewhere.[/color]
                    >
                    > Well, let me know ;) Maybe I could get an demo or something? That would
                    > be nice :)[/color]

                    You may be disappointed, but here it is:



                    The core of the library was written in a hurry over two years ago; later refinements
                    make it only slightly more robust. It was never really intended for anything other
                    than exploring the structure of PDF files.

                    Basic use:

                    import pdftools

                    file = "MyFile.pdf "
                    doc = pdftools.PDFdoc ument(file)

                    print "Document uses PDF format version", doc.document_ve rsion()

                    pages = doc.count_pages ()
                    print "Document contains %i pages." % pages

                    if pages > 123:

                    page123 = doc.read_page(1 23)
                    contents123 = page123.read_co ntents()

                    print "The objects found in this page:"
                    print
                    print contents123.con tents

                    I've not really dealt with the coordinate system very well. Ideally, it would be
                    trivial to extract all the device-independent positioning information but,
                    whenever I start to look at this, I get distracted. :-)

                    Have fun, and don't expect too much,

                    David

                    Comment

                    Working...