Blog post thumbnail

Journey to the Center of the PDF

When you think about PDFs, you probably imagine static documents that you read, save, or print. But PDFs are more than that. In many cases they’re filled with valuable, process-relevant information waiting to be unlocked. Think of form fields, drop-downs, or even digital signatures—these elements hold often the key to digitising and automating workflows.

In this post, I share my experience of uncovering document form data using Python. It was a fun challenge with practical implications. There’s untapped potential here, and I wanted to see what’s possible.

How PDFs Work (in a Nutshell)

To extract data from PDF forms, it's helpful to understand how PDFs are structured. A PDF file is more than a simple flat document—it’s a self-contained system with a layered data model. A more detailed article about the structure of the PDF file format can be found here. If a simplified overview is enough, here you go:

The PDF Data Model

PDF Data Model Overview
  1. Catalog:
    The top-level structure, acts as an index for the document. It references the document’s pages, metadata, and other key elements.
  2. Metadata:
    A reference to the metadata stream for the document, which can include information like title, author, and creation date.
  3. Outlines:
    Points to the document’s outline (table of contents or bookmarks).
  4. Pages:
    A reference to the root of the document’s page tree, containing information about / of all pages in the PDF.
  5. AcroForm:
    Contains the interactive form data of the PDF, such as form fields and their properties.

To check out these elements for yourself, you can print themlike this:

python
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
pdf_path = 'sample.pdf'
with open(pdf_path, 'rb') as file:
parser = PDFParser(file)
doc = PDFDocument(parser)
print(doc.catalog.items())

Types of PDF Forms

There are two primary types of PDF forms:

  1. AcroForms:
    The original PDF form standard, defined by Adobe. These are widely supported and store field data as part of the PDF structure.
  2. XFA Forms (XML Forms Architecture):
    A newer format introduced by Adobe, designed for more complex, interactive forms. These embed XML-based form templates in the PDF.

Compatibility Notes

  • AcroForms: Compatible with nearly all PDF libraries and viewers, making them the preferred choice for most automation tasks.
  • XFA Forms: While powerful, they can be tricky due to limited support in popular tools like PDFMiner or PyMuPDF.
  • Hybrid PDFs: Some PDFs include both AcroForm and XFA forms. It’s essential to check which format is active to avoid errors.

Where Form Data is Stored

The AcroForm dictionary in the document catalog contains information about all form elements.

python
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdftypes import resolve1
pdf_path = 'sample.pdf'
with open(pdf_path, 'rb') as file:
parser = PDFParser(file)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm']).get('Fields', [])

Each form field is an object with attributes such as:

  • Field Name (T): A unique name for the field (e.g., "FirstName").
  • Field Value (V): The data entered or selected in the field.
  • Field Type (FT): Specifies whether the field is a text input, checkbox, radio button, dropdown, etc.
  • Appearance (AP): Defines how the field looks on the page, such as its border and background.
  • Options (Opt): For dropdowns or list boxes, this includes all possible values.
  • Coordinates (Rect): The exact location of the field on the page.
python
from pdfminer.pdftypes import resolve1
from pdfminer.psparser import PSLiteral
with open(pdf_path, 'rb') as file:
parser = PDFParser(file)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm']).get('Fields', [])
for field in fields:
field_obj = resolve1(field)
field_name = field_obj.get('T') # Field Name
field_value = resolve1(field_obj.get('V')) if field_obj.get('V') else None # Field Value
field_type = field_obj.get('FT') # Field Type
appearance = field_obj.get('AP') # Appearance Dictionary
options = field_obj.get('Opt') # Dropdown/List Box Options
rect = field_obj.get('Rect') # Coordinates (location on the page)
print("-----------------------------")
print(f'Field Name: {field_name}')
print(f'Field Value: {field_value}')

Choosing the Right Tools

There are several libraries available for working with PDFs in Python, each with its strengths. Here’s why I chose PDFMiner.Six:

  • Granular control: PDFMiner allows low-level access to PDF metadata. This was helpful as some packages didn't provide every piece of metadata I wanted.
  • Commercial-friendly: Unlike PyMuPDF, whose licensing terms weren’t clear for my use case, PDFMiner had no such ambiguity.
  • Flexible: PDFMiner excels in scenarios requiring extensive customization.

That said, I also used PyMuPDF for debugging and visualisation. Its ability to render pages and highlight detected elements made the development process much easier.

Extracting Form Fields: The Process

Here’s a simplified breakdown of how the code works:

  1. Parse the PDF document: A PDF document is parsed using a PDFParser object, which reads the byte stream. PDFDocument interprets the parsed data into a structure we can navigate.
  2. Get form fields from the AcroForm section: The catalog is the root object of a PDF document, containing metadata and other high-level structures. AcroForm is a dictionary that defines interactive form properties in the PDF.
  3. Iterate through pages: Extract pages one by one and process their annotations.
  4. Process annotations: Annotations are additional content or metadata associated with specific areas of the page. For forms, annotations often correspond to form fields (like text boxes or checkboxes).
  5. Resolve annotation references: Annotations are often indirect objects. The resolve1(<object>) method retrieves their actual content.
  6. Check if the annotation is of type 'Widget': Widget annotations represent form field widgets in the PDF.
  7. Determine the field object: Fields can be defined in parent objects; if a parent exists, resolve it.
  8. Extract the field name & value:
    1. 'T' (Title): A string representing the name of the form field.
    2. 'V' (Value): The current value of the field (e.g., entered text or selected option)
  9. Decode extracted details: Decoding both field name and value to utf-8 strings.
python
pdf_path = 'sample.pdf'
pages_data = {}
with open(pdf_path, 'rb') as file:
# Step 1: Parse the PDF document.
parser = PDFParser(file)
doc = PDFDocument(parser)
if 'AcroForm' not in doc.catalog:
print("No AcroForm found in the PDF document.")
exit()
# Step 2: Extract form fields from the AcroForm section of the document catalog.
field_id = 1
fields = resolve1(doc.catalog['AcroForm']).get('Fields', [])
# Step 3: Iterate through pages.
for page_num, page in enumerate(PDFPage.create_pages(doc)):
page_number = page_num + 1
# Initialize a dictionary to store field data for this page
pages_data[page_number] = {
'page_number': page_number,
'fields': []
}
# Step 4: Process annotations on the page.
if page.annots:
# Step 5: Resolve annotation references.
annots = resolve1(page.annots)
if isinstance(annots, list):
for annot in annots:
# Step 6: Check if the annotation is of type 'Widget'.
annot_obj = resolve1(annot)
subtype = annot_obj.get('Subtype')
# Step 7: Determine the field object.
if isinstance(subtype, PSLiteral) and subtype.name == 'Widget':
parent = annot_obj.get('Parent')
field_obj = resolve1(parent) if parent else annot_obj
# Step 8: Extract the field name ('T') and field value ('V')
field_name = field_obj.get('T') if field_obj.get('T') else None
field_value = field_obj.get('V') if field_obj.get('V') else None
rect = annot_obj.get('Rect')
# Step 9: Decode and store extracted details.
field = {
'field_id': field_id,
'form_field': decode_text(field_name),
'value': resolve1(field_value),
'rect': {
'x1': int(rect[0]),
'y1': int(rect[1]),
'x2': int(rect[2]),
'y2': int(rect[3]),
}
}
pages_data[page_number]['fields'].append(field)
field_id += 1

For visualization, PyMuPDF helped overlay these details on the document for verification.

python
def draw_boxes_on_pdf(input_pdf_path: str, output_pdf_path: str, pages_data: dict):
# Open the PDF
doc = fitz.open(input_pdf_path)
# Process each page
for page_number, page_data in pages_data.items():
page = doc[page_number - 1] # Convert to 0-based page numbers
page_height = page.rect.height
# For each field on this page, draw a rectangle and add a number
for field in page_data['fields']:
rect = field['rect']
# Convert coordinates (flip vertically)
box = fitz.Rect(
rect['x1'], # left
page_height - rect['y2'], # top (flipped)
rect['x2'], # right
page_height - rect['y1'] # bottom (flipped)
)
# Draw red rectangle
page.draw_rect(box, color=(1, 0, 0), width=1)
# Calculate center of box for text placement
center_x = (rect['x1'] + rect['x2']) / 2
center_y = page_height - ((rect['y1'] + rect['y2']) / 2)
# Add field ID number in the center
page.insert_text(
point=(center_x, center_y),
text=str(field['field_id']), # Use field's ID
color=(1, 0, 0),
fontsize=12
)
# Save the modified PDF
doc.save(output_pdf_path)
doc.close()

Example

Example Form Extraction

Note: You can extract more information than is visualised in this example.

What’s Next?

This project is just the beginning. Here’s where I see it going:

  1. Form Extraction Pipeline: Build an end-to-end solution for extracting and processing forms.
  2. Template Creation: Use extracted fields as templates for extracting fields of scanned documents and for training AI for Key Information Extraction.
  3. Synthetic Data Generation: Generate realistic training data for machine learning models.
  4. Integration with SAS Viya: Developing a Custom Step for SAS Studio to easily use this functionality in a drag & drop fashion.

Conclusion

After figuring out how to interact with the (convoluted) PDF data format, programmatically extracting form data from PDFs with Python is rather straight forward. It is also more than just a one-time technical exercise. I see it as another building block to make documents (more) usable for AI and for automation use cases. As outlined above, I have a lot of ideas where I can take this project next and I’m sure, I'll get even more ideas along the way.

Wanna check out the whole project? Follow this link to the GitHub repo.