
When you think about PDFs, you probably imagine static documents that you read, save, or print. But PDFs are more than that. In many cases they’re filled with valuable, process-relevant information waiting to be unlocked. Think of form fields, drop-downs, or even digital signatures—these elements hold often the key to digitising and automating workflows.
In this post, I share my experience of uncovering document form data using Python. It was a fun challenge with practical implications. There’s untapped potential here, and I wanted to see what’s possible.
How PDFs Work (in a Nutshell)
To extract data from PDF forms, it's helpful to understand how PDFs are structured. A PDF file is more than a simple flat document—it’s a self-contained system with a layered data model. A more detailed article about the structure of the PDF file format can be found here. If a simplified overview is enough, here you go:
The PDF Data Model

- Catalog:
The top-level structure, acts as an index for the document. It references the document’s pages, metadata, and other key elements. - Metadata:
A reference to the metadata stream for the document, which can include information like title, author, and creation date. - Outlines:
Points to the document’s outline (table of contents or bookmarks). - Pages:
A reference to the root of the document’s page tree, containing information about / of all pages in the PDF. - AcroForm:
Contains the interactive form data of the PDF, such as form fields and their properties.
To check out these elements for yourself, you can print themlike this:
from pdfminer.pdfdocument import PDFDocumentfrom pdfminer.pdfparser import PDFParserpdf_path = 'sample.pdf'with open(pdf_path, 'rb') as file:parser = PDFParser(file)doc = PDFDocument(parser)print(doc.catalog.items())
Types of PDF Forms
There are two primary types of PDF forms:
- AcroForms:
The original PDF form standard, defined by Adobe. These are widely supported and store field data as part of the PDF structure. - XFA Forms (XML Forms Architecture):
A newer format introduced by Adobe, designed for more complex, interactive forms. These embed XML-based form templates in the PDF.
Compatibility Notes
- AcroForms: Compatible with nearly all PDF libraries and viewers, making them the preferred choice for most automation tasks.
- XFA Forms: While powerful, they can be tricky due to limited support in popular tools like PDFMiner or PyMuPDF.
- Hybrid PDFs: Some PDFs include both AcroForm and XFA forms. It’s essential to check which format is active to avoid errors.
Where Form Data is Stored
The AcroForm dictionary in the document catalog contains information about all form elements.
from pdfminer.pdfdocument import PDFDocumentfrom pdfminer.pdfparser import PDFParserfrom pdfminer.pdftypes import resolve1pdf_path = 'sample.pdf'with open(pdf_path, 'rb') as file:parser = PDFParser(file)doc = PDFDocument(parser)fields = resolve1(doc.catalog['AcroForm']).get('Fields', [])
Each form field is an object with attributes such as:
- Field Name (T): A unique name for the field (e.g., "FirstName").
- Field Value (V): The data entered or selected in the field.
- Field Type (FT): Specifies whether the field is a text input, checkbox, radio button, dropdown, etc.
- Appearance (AP): Defines how the field looks on the page, such as its border and background.
- Options (Opt): For dropdowns or list boxes, this includes all possible values.
- Coordinates (Rect): The exact location of the field on the page.
from pdfminer.pdftypes import resolve1from pdfminer.psparser import PSLiteralwith open(pdf_path, 'rb') as file:parser = PDFParser(file)doc = PDFDocument(parser)fields = resolve1(doc.catalog['AcroForm']).get('Fields', [])for field in fields:field_obj = resolve1(field)field_name = field_obj.get('T') # Field Namefield_value = resolve1(field_obj.get('V')) if field_obj.get('V') else None # Field Valuefield_type = field_obj.get('FT') # Field Typeappearance = field_obj.get('AP') # Appearance Dictionaryoptions = field_obj.get('Opt') # Dropdown/List Box Optionsrect = field_obj.get('Rect') # Coordinates (location on the page)print("-----------------------------")print(f'Field Name: {field_name}')print(f'Field Value: {field_value}')
Choosing the Right Tools
There are several libraries available for working with PDFs in Python, each with its strengths. Here’s why I chose PDFMiner.Six:
- Granular control: PDFMiner allows low-level access to PDF metadata. This was helpful as some packages didn't provide every piece of metadata I wanted.
- Commercial-friendly: Unlike PyMuPDF, whose licensing terms weren’t clear for my use case, PDFMiner had no such ambiguity.
- Flexible: PDFMiner excels in scenarios requiring extensive customization.
That said, I also used PyMuPDF for debugging and visualisation. Its ability to render pages and highlight detected elements made the development process much easier.
Extracting Form Fields: The Process
Here’s a simplified breakdown of how the code works:
- Parse the PDF document: A PDF document is parsed using a PDFParser object, which reads the byte stream. PDFDocument interprets the parsed data into a structure we can navigate.
- Get form fields from the AcroForm section: The catalog is the root object of a PDF document, containing metadata and other high-level structures. AcroForm is a dictionary that defines interactive form properties in the PDF.
- Iterate through pages: Extract pages one by one and process their annotations.
- Process annotations: Annotations are additional content or metadata associated with specific areas of the page. For forms, annotations often correspond to form fields (like text boxes or checkboxes).
- Resolve annotation references: Annotations are often indirect objects. The resolve1(<object>) method retrieves their actual content.
- Check if the annotation is of type 'Widget': Widget annotations represent form field widgets in the PDF.
- Determine the field object: Fields can be defined in parent objects; if a parent exists, resolve it.
- Extract the field name & value:
- 'T' (Title): A string representing the name of the form field.
- 'V' (Value): The current value of the field (e.g., entered text or selected option)
- Decode extracted details: Decoding both field name and value to utf-8 strings.
pdf_path = 'sample.pdf'pages_data = {}with open(pdf_path, 'rb') as file:# Step 1: Parse the PDF document.parser = PDFParser(file)doc = PDFDocument(parser)if 'AcroForm' not in doc.catalog:print("No AcroForm found in the PDF document.")exit()# Step 2: Extract form fields from the AcroForm section of the document catalog.field_id = 1fields = resolve1(doc.catalog['AcroForm']).get('Fields', [])# Step 3: Iterate through pages.for page_num, page in enumerate(PDFPage.create_pages(doc)):page_number = page_num + 1# Initialize a dictionary to store field data for this pagepages_data[page_number] = {'page_number': page_number,'fields': []}# Step 4: Process annotations on the page.if page.annots:# Step 5: Resolve annotation references.annots = resolve1(page.annots)if isinstance(annots, list):for annot in annots:# Step 6: Check if the annotation is of type 'Widget'.annot_obj = resolve1(annot)subtype = annot_obj.get('Subtype')# Step 7: Determine the field object.if isinstance(subtype, PSLiteral) and subtype.name == 'Widget':parent = annot_obj.get('Parent')field_obj = resolve1(parent) if parent else annot_obj# Step 8: Extract the field name ('T') and field value ('V')field_name = field_obj.get('T') if field_obj.get('T') else Nonefield_value = field_obj.get('V') if field_obj.get('V') else Nonerect = annot_obj.get('Rect')# Step 9: Decode and store extracted details.field = {'field_id': field_id,'form_field': decode_text(field_name),'value': resolve1(field_value),'rect': {'x1': int(rect[0]),'y1': int(rect[1]),'x2': int(rect[2]),'y2': int(rect[3]),}}pages_data[page_number]['fields'].append(field)field_id += 1
For visualization, PyMuPDF helped overlay these details on the document for verification.
def draw_boxes_on_pdf(input_pdf_path: str, output_pdf_path: str, pages_data: dict):# Open the PDFdoc = fitz.open(input_pdf_path)# Process each pagefor page_number, page_data in pages_data.items():page = doc[page_number - 1] # Convert to 0-based page numberspage_height = page.rect.height# For each field on this page, draw a rectangle and add a numberfor field in page_data['fields']:rect = field['rect']# Convert coordinates (flip vertically)box = fitz.Rect(rect['x1'], # leftpage_height - rect['y2'], # top (flipped)rect['x2'], # rightpage_height - rect['y1'] # bottom (flipped))# Draw red rectanglepage.draw_rect(box, color=(1, 0, 0), width=1)# Calculate center of box for text placementcenter_x = (rect['x1'] + rect['x2']) / 2center_y = page_height - ((rect['y1'] + rect['y2']) / 2)# Add field ID number in the centerpage.insert_text(point=(center_x, center_y),text=str(field['field_id']), # Use field's IDcolor=(1, 0, 0),fontsize=12)# Save the modified PDFdoc.save(output_pdf_path)doc.close()
Example

Note: You can extract more information than is visualised in this example.
What’s Next?
This project is just the beginning. Here’s where I see it going:
- Form Extraction Pipeline: Build an end-to-end solution for extracting and processing forms.
- Template Creation: Use extracted fields as templates for extracting fields of scanned documents and for training AI for Key Information Extraction.
- Synthetic Data Generation: Generate realistic training data for machine learning models.
- Integration with SAS Viya: Developing a Custom Step for SAS Studio to easily use this functionality in a drag & drop fashion.
Conclusion
After figuring out how to interact with the (convoluted) PDF data format, programmatically extracting form data from PDFs with Python is rather straight forward. It is also more than just a one-time technical exercise. I see it as another building block to make documents (more) usable for AI and for automation use cases. As outlined above, I have a lot of ideas where I can take this project next and I’m sure, I'll get even more ideas along the way.
Wanna check out the whole project? Follow this link to the GitHub repo.