pdf2txt

LB · Jul 18 '05, 11:24 AM

Re: pdf2txt

[color=blue]
> I know that a txt2pdf exists, was checking to see if the opposite would
> as well.[/color]

I'm sure that from Acrobat you can save a .pdf as .rtf (that is text...).
Then it will be easy to do anything on it.
I remember also some utilities to "pdf2txt", try a search on google.

LB

**Aurelio Martin** · Jul 18 '05, 11:24 AM

Re: pdf2txt

B P wrote:[color=blue]
> Is there a way via Python or even Perl to capture records from a pdf and
> output a delimited text file? My work has a situation with a trunk
> load of data forms that were scanned as pdfs.
>
> The data needs to be taken from the forms and moved into a database, so
> I figure that comma-delimited format will work fine. The amount of
> man-hours it would take to manually do this is very cost-prohibitive for
> what we have to work with.
>
> I know that a txt2pdf exists, was checking to see if the opposite would
> as well.
>
> BP[/color]

You may try XPDF

XpdfReader

http://www.foolabs.com/xpdf/

They include source code and some utilities like pdfimages of pdftotext.
Maybe you can call these from Python, or link via a C extension.

Hope this helps

Aurelio

**Benjamin Niemann** · Jul 18 '05, 11:24 AM

Re: pdf2txt

B P wrote:[color=blue]
> Is there a way via Python or even Perl to capture records from a pdf and
> output a delimited text file? My work has a situation with a trunk
> load of data forms that were scanned as pdfs.
>
> The data needs to be taken from the forms and moved into a database, so
> I figure that comma-delimited format will work fine. The amount of
> man-hours it would take to manually do this is very cost-prohibitive for
> what we have to work with.
>
> I know that a txt2pdf exists, was checking to see if the opposite would
> as well.
>
> BP[/color]
Have a look at pdftext, part of xpdf
(http://www.foolabs.com/xpdf/home.html). This will convert the pdf into
plaintext format. You will probably have to parse this plaintext to
convert it into somesthing useful.

**Marco Aschwanden** · Jul 18 '05, 11:24 AM

Re: pdf2txt

For me 'ps2ascii' did the job...

**Steve Holden** · Jul 18 '05, 11:24 AM

Re: pdf2txt

LB wrote:[color=blue][color=green]
>>I know that a txt2pdf exists, was checking to see if the opposite would
>>as well.[/color]
>
>
> I'm sure that from Acrobat you can save a .pdf as .rtf (that is text...).
> Then it will be easy to do anything on it.
> I remember also some utilities to "pdf2txt", try a search on google.
>
> LB
>
>[/color]
Unfortunately the text you get from Acrobat, or most other
transformations on PDF, won't guarantee any particular order of the
elements. This will make pasing difficult, but if all your documents are
similar you may get enough similarity from a text (not, IIRC, rich text)
file from Acrobat.

For extra marks you can use Acrobat's automation interfaces to actually
convert the PDFs. Good luck!

regards
Steve

**Cameron Laird** · Jul 18 '05, 11:25 AM

Re: pdf2txt

In article <Z3Atc.28409$zO 3.22415@newsrea d2.news.atl.ear thlink.net>,
B P <nature_boy@min dspring.com> wrote:[color=blue]
>Is there a way via Python or even Perl to capture records from a pdf and
> output a delimited text file? My work has a situation with a trunk[/color]

**Tim Roberts** · Jul 18 '05, 11:26 AM

Re: pdf2txt

B P <nature_boyMYPA NTS@mindspring. com> wrote:[color=blue]
>
>Is there a way via Python or even Perl to capture records from a pdf and
> output a delimited text file? My work has a situation with a trunk
>load of data forms that were scanned as pdfs.[/color]

SCANNED as PDFs? Do you mean these were paper forms, filled in using
printed handwriting, then scanned into a TIFF and wrapped up in a PDF?

If so, your job is next to impossible. You can extract the original
bitmapped image out of the PDF, and from that you MIGHT be able to use an
OCR program to extract the text, but unless the forms were specifically
designed for machine reading, that process tends to be error-prone. It
might be more efficient to have human beings translate them.
--
- Tim Roberts, timr@probo.com
Providenza & Boekelheide, Inc.

pdf2txt

pdf2txt

Comment

Comment

Comment

Comment

Comment

Comment

Comment