PDF2Txt

Collapse
X
 
  • Time
  • Show
Clear All
new posts
  • kamalatanvi
    New Member
    • Oct 2007
    • 7

    PDF2Txt

    How to use this module File::Extract:: PDF to extract the text from pdf. Need the guidance in writing the program.

    thank you
  • Kelicula
    Recognized Expert New Member
    • Jul 2007
    • 176

    #2
    Originally posted by kamalatanvi
    How to use this module File::Extract:: PDF to extract the text from pdf. Need the guidance in writing the program.

    thank you
    I do not have that module loaded, and there is not a lot of documentation on it.
    But from taking a look at the source, it seems this would print each line in the entire file:
    [code=perl]

    #!/usr/bin/perl

    use strict;
    use warnings;

    use File::Extract:: PDF;

    my $target = new File::Extract:: PDF;

    $target->extract(FH, "pdfdocument.pd f") or die;

    while <FH> {
    print "$_\n";
    }

    close(FH);

    [/code]

    Unfortunately I can't test it.
    I am only hoping to get the ball rolling, and hope to learn from this myself.

    If (or when) this doesn't work, post any errors you may get.

    goodday

    Comment

    • numberwhun
      Recognized Expert Moderator Specialist
      • May 2007
      • 3467

      #3
      Originally posted by Kelicula
      I do not have that module loaded, and there is not a lot of documentation on it.
      But from taking a look at the source, it seems this would print each line in the entire file:
      [code=perl]

      #!/usr/bin/perl

      use strict;
      use warnings;

      use File::Extract:: PDF;

      my $target = new File::Extract:: PDF;

      $target->extract(FH, "pdfdocument.pd f") or die;

      while <FH> {
      print "$_\n";
      }

      close(FH);

      [/code]

      Unfortunately I can't test it.
      I am only hoping to get the ball rolling, and hope to learn from this myself.

      If (or when) this doesn't work, post any errors you may get.

      goodday
      And, in addition, if you wanted to write each line to its own text file, then just use the open() function to open the text file and then add the file handle to the print statement, like so:

      [code=perl]

      #!/usr/bin/perl

      use strict;
      use warnings;

      use File::Extract:: PDF;

      open(NEWFILE, ">./newfile.txt");
      my $target = new File::Extract:: PDF;

      $target->extract(FH, "pdfdocument.pd f") or die;

      while <FH> {
      print NEWFILE "$_\n";
      }

      close(FH);
      close(NEWFILE);
      [/code]

      Regards,

      Jeff

      Comment

      • kamalatanvi
        New Member
        • Oct 2007
        • 7

        #4
        Thank you for your reply.
        If I execute this code I am getting following error. how to clear this?
        Bareword "FH" not allowed


        waiting for ur reply
        regs,
        kamalatanvi

        Comment

        • numberwhun
          Recognized Expert Moderator Specialist
          • May 2007
          • 3467

          #5
          Originally posted by kamalatanvi
          Thank you for your reply.
          If I execute this code I am getting following error. how to clear this?
          Bareword "FH" not allowed


          waiting for ur reply
          regs,
          kamalatanvi
          I would have to say that this module (being version .06, which is WELL below version 1.0) probably has many issues as it looks to be fairly new. It may be that the extract function is not completely debugged to work correctly.

          You have a couple of options here.

          1. I would go through the module code and ensure that the way you are using it is completely correct.
          2. If it is, you could always email the author and see what their input is.
          3. You could always implement your own solution to this ( a lot longer route).

          This is generally the problem with modules that are so very new. They tend to be "not ready for primetime" but are available on CPAN. If you check, there is NO documentation on CPAN for this module either.

          Regards,

          Jeff

          Comment

          • aleluis
            New Member
            • Oct 2007
            • 4

            #6
            Hi, I'm quite new to Perl world but I think I can help you somehow, though using CAM::PDF module. I found I could extract text from pdf pages with the following sentences:

            Code:
            .........
            use CAM::PDF;
            
            .........
            
            my $pdf = CAM::PDF->new($filename);
            
            print ARCHIVO ( CAM::PDF::PageText->render($pdf->getPageContentTree($numpage)));
            
            ........
            This should print into the Filehandle ARCHIVO, associated to a *.txt file in my program, the text in the pdf page as plain text, as the method returns a string, allowing you further processing. Hope this helps.

            Comment

            Working...