My project involves reading text from a bunch of PDF form files for which I'm using PyPDF2 open source library. There is no issue in getting the text data as follows:
[code=python]
reader = PdfReader("data/test.pdf")
cnt = len(reader.page s)
print("reading pdf (%d pages)" % cnt)
page = reader.pages[cnt-1]
lines = page.extract_te xt().splitlines ()
print("%d lines extracted..." % len(lines))
[/code]
However, this text doesn't contain the checked statuses of the radio and checkboxes. I just get normal text (like "Yes No" for example) instead of these values.
I also tried the reader.get_fiel ds() and reader.get_form _text_fields() methods as described in their documentation but they return empty values. I also tried reading it through annotations but no "/Annots" found on the page. When I open the PDF in a notepad++ to see its meta data, this is what I get:
[code=bash]
%PDF-1.4
%²³´µ
%Generated by ExpertPdf v9.2.2
[/code]
It appears to me that these checkboxes aren't usual form fields used in PDF but appear similar to HTML elements. Is there any way to extract these fields using python?
[code=python]
reader = PdfReader("data/test.pdf")
cnt = len(reader.page s)
print("reading pdf (%d pages)" % cnt)
page = reader.pages[cnt-1]
lines = page.extract_te xt().splitlines ()
print("%d lines extracted..." % len(lines))
[/code]
However, this text doesn't contain the checked statuses of the radio and checkboxes. I just get normal text (like "Yes No" for example) instead of these values.
I also tried the reader.get_fiel ds() and reader.get_form _text_fields() methods as described in their documentation but they return empty values. I also tried reading it through annotations but no "/Annots" found on the page. When I open the PDF in a notepad++ to see its meta data, this is what I get:
[code=bash]
%PDF-1.4
%²³´µ
%Generated by ExpertPdf v9.2.2
[/code]
It appears to me that these checkboxes aren't usual form fields used in PDF but appear similar to HTML elements. Is there any way to extract these fields using python?