pdf2txt

Collapse
This topic is closed.
X
X
 
  • Time
  • Show
Clear All
new posts
  • B P

    pdf2txt

    Is there a way via Python or even Perl to capture records from a pdf and
    output a delimited text file? My work has a situation with a trunk
    load of data forms that were scanned as pdfs.

    The data needs to be taken from the forms and moved into a database, so
    I figure that comma-delimited format will work fine. The amount of
    man-hours it would take to manually do this is very cost-prohibitive for
    what we have to work with.

    I know that a txt2pdf exists, was checking to see if the opposite would
    as well.

    BP
  • LB

    #2
    Re: pdf2txt

    [color=blue]
    > I know that a txt2pdf exists, was checking to see if the opposite would
    > as well.[/color]

    I'm sure that from Acrobat you can save a .pdf as .rtf (that is text...).
    Then it will be easy to do anything on it.
    I remember also some utilities to "pdf2txt", try a search on google.

    LB


    Comment

    • Aurelio Martin

      #3
      Re: pdf2txt


      B P wrote:[color=blue]
      > Is there a way via Python or even Perl to capture records from a pdf and
      > output a delimited text file? My work has a situation with a trunk
      > load of data forms that were scanned as pdfs.
      >
      > The data needs to be taken from the forms and moved into a database, so
      > I figure that comma-delimited format will work fine. The amount of
      > man-hours it would take to manually do this is very cost-prohibitive for
      > what we have to work with.
      >
      > I know that a txt2pdf exists, was checking to see if the opposite would
      > as well.
      >
      > BP[/color]

      You may try XPDF



      They include source code and some utilities like pdfimages of pdftotext.
      Maybe you can call these from Python, or link via a C extension.

      Hope this helps

      Aurelio

      Comment

      • Benjamin Niemann

        #4
        Re: pdf2txt

        B P wrote:[color=blue]
        > Is there a way via Python or even Perl to capture records from a pdf and
        > output a delimited text file? My work has a situation with a trunk
        > load of data forms that were scanned as pdfs.
        >
        > The data needs to be taken from the forms and moved into a database, so
        > I figure that comma-delimited format will work fine. The amount of
        > man-hours it would take to manually do this is very cost-prohibitive for
        > what we have to work with.
        >
        > I know that a txt2pdf exists, was checking to see if the opposite would
        > as well.
        >
        > BP[/color]
        Have a look at pdftext, part of xpdf
        (http://www.foolabs.com/xpdf/home.html). This will convert the pdf into
        plaintext format. You will probably have to parse this plaintext to
        convert it into somesthing useful.

        Comment

        • Marco Aschwanden

          #5
          Re: pdf2txt

          For me 'ps2ascii' did the job...


          Comment

          • Steve Holden

            #6
            Re: pdf2txt

            LB wrote:[color=blue][color=green]
            >>I know that a txt2pdf exists, was checking to see if the opposite would
            >>as well.[/color]
            >
            >
            > I'm sure that from Acrobat you can save a .pdf as .rtf (that is text...).
            > Then it will be easy to do anything on it.
            > I remember also some utilities to "pdf2txt", try a search on google.
            >
            > LB
            >
            >[/color]
            Unfortunately the text you get from Acrobat, or most other
            transformations on PDF, won't guarantee any particular order of the
            elements. This will make pasing difficult, but if all your documents are
            similar you may get enough similarity from a text (not, IIRC, rich text)
            file from Acrobat.

            For extra marks you can use Acrobat's automation interfaces to actually
            convert the PDFs. Good luck!

            regards
            Steve

            Comment

            • Cameron Laird

              #7
              Re: pdf2txt

              In article <Z3Atc.28409$zO 3.22415@newsrea d2.news.atl.ear thlink.net>,
              B P <nature_boy@min dspring.com> wrote:[color=blue]
              >Is there a way via Python or even Perl to capture records from a pdf and
              > output a delimited text file? My work has a situation with a trunk[/color]

              Comment

              • Tim Roberts

                #8
                Re: pdf2txt

                B P <nature_boyMYPA NTS@mindspring. com> wrote:[color=blue]
                >
                >Is there a way via Python or even Perl to capture records from a pdf and
                > output a delimited text file? My work has a situation with a trunk
                >load of data forms that were scanned as pdfs.[/color]

                SCANNED as PDFs? Do you mean these were paper forms, filled in using
                printed handwriting, then scanned into a TIFF and wrapped up in a PDF?

                If so, your job is next to impossible. You can extract the original
                bitmapped image out of the PDF, and from that you MIGHT be able to use an
                OCR program to extract the text, but unless the forms were specifically
                designed for machine reading, that process tends to be error-prone. It
                might be more efficient to have human beings translate them.
                --
                - Tim Roberts, timr@probo.com
                Providenza & Boekelheide, Inc.

                Comment

                Working...