Extracting text from pdf

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • JustinCase

    Extracting text from pdf

    Hi,

    I have to index the text of a pdf document.

    Does any of you know of a PHP script/extension or a binary that is able
    to extract the text ?

    The pdf extension mentioned in the php.net docs seem to indicate that
    it's for _creation_ of documents only, is that so? Same with all the
    PHP classes i have found.

    Regards,
    Johnny

    --
    Never express yourself more clearly than you are able to think.
    - Niels Bohr
  • Alvaro G Vicario

    #2
    Re: Extracting text from pdf

    *** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):[color=blue]
    > Does any of you know of a PHP script/extension or a binary that is able
    > to extract the text ?[/color]

    There's a Unix program that might help you: ps2ascii

    --
    -- Álvaro G. Vicario - Burgos, Spain
    -- Thank you for not e-mailing me your questions
    --

    Comment

    • JustinCase

      #3
      Re: Extracting text from pdf

      On 25-10-2004 Alvaro G Vicario wrote:
      [color=blue]
      >*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):[color=green]
      >> Does any of you know of a PHP script/extension or a binary that is
      >>able to extract the text ?[/color]
      >
      >There's a Unix program that might help you: ps2ascii[/color]

      Thanks for the pointer,
      I'll have a look

      /Johnny

      --
      He's turned his life around. He used to be depressed and miserable. Now
      he's miserable and depressed.
      - David Frost

      Comment

      • JustinCase

        #4
        Re: Extracting text from pdf

        On 25-10-2004 Alvaro G Vicario wrote:
        [color=blue]
        >*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):[color=green]
        >> Does any of you know of a PHP script/extension or a binary that is
        >>able to extract the text ?[/color]
        >
        >There's a Unix program that might help you: ps2ascii[/color]

        Does anyone know of any other tool for PDF text extraction ?


        ps2ascii cannot seem to parse all of the pdf file. I tried the pstotext
        tool to, but with same result.
        I figured that it has something to do with my ghostscript version being
        too old (7.05, newest is 8.14).

        Unfortunally I have no experience in installing/upgrading unix stuff
        (having spend half an evening trying in vain and confusion).



        Regards,
        Johnny

        --
        In the beginning the Universe was created. This has made a lot of
        people very angry and been widely regarded as a bad move.
        - Douglas Adams

        Comment

        • Darien Kruss

          #5
          Re: Extracting text from pdf



          xpdf will do this


          I use it with the namazu search tool (http://www.namazu.org/) to
          provide search capabilities on websites that span web pages, office
          docs, and PDF files.


          In article <xn0doykmn5tq0t 000@news.tele.d k>, JustinCase <no@spam> wrote:
          [color=blue]
          > On 25-10-2004 Alvaro G Vicario wrote:
          >[color=green]
          > >*** JustinCase wrote/escribió (25 Oct 2004 16:09:36 GMT):[color=darkred]
          > >> Does any of you know of a PHP script/extension or a binary that is
          > >>able to extract the text ?[/color]
          > >
          > >There's a Unix program that might help you: ps2ascii[/color]
          >
          > Does anyone know of any other tool for PDF text extraction ?
          >
          >
          > ps2ascii cannot seem to parse all of the pdf file. I tried the pstotext
          > tool to, but with same result.
          > I figured that it has something to do with my ghostscript version being
          > too old (7.05, newest is 8.14).
          >
          > Unfortunally I have no experience in installing/upgrading unix stuff
          > (having spend half an evening trying in vain and confusion).
          >
          >
          >
          > Regards,
          > Johnny[/color]

          Comment

          • JustinCase

            #6
            Re: Extracting text from pdf

            On 26-10-2004 Darien Kruss wrote:
            [color=blue]
            >
            >
            >xpdf will do this
            >http://www.foolabs.com/xpdf/
            >
            >I use it with the namazu search tool (http://www.namazu.org/) to
            >provide search capabilities on websites that span web pages, office
            >docs, and PDF files.
            >
            >[/color]


            Hi Darian,

            Perfect.

            Funny though. I'd been to the site a few times in my search but had
            somehow concluded that xpdf was not what I wanted. Looking too hard can
            make you miss the obvious, eh !? So many hairs could still be resting
            comfortably on my head. :)

            Thanks,
            Johnny

            --
            The universe is a big place, perhaps the biggest.
            - Kilgore Trout

            Comment

            Working...