How to Parse PDF Files in Go
You are building a document ingestion service. A user uploads a PDF contract, and your backend needs to extract the text to index it in a search engine. You open the file in your editor and see %PDF-1.4 followed by binary garbage. You try to read it line by line and get noise. PDFs are not text files. They are structured binary archives that describe layout, fonts, and graphics. Go's standard library handles HTTP, JSON, and XML, but it leaves PDF parsing to third-party libraries. You need a library that understands the PDF specification, decodes the binary structure, and gives you a clean API to access pages and text.
PDFs are layout engines, not text streams
PDF stands for Portable Document Format. The design goal is visual consistency across devices. The format achieves this by describing exactly where every glyph and vector shape goes. The file contains a tree of objects. Some objects are dictionaries with metadata. Some are streams of compressed data. Some are references to other objects.
The parser reads the trailer to find the cross-reference table. The cross-reference table maps object numbers to byte offsets. The parser builds an in-memory graph. When you ask for text, the library decodes the content stream. The content stream contains drawing commands. The library interprets commands like Tj (show text) and looks up the font mapping. The font mapping translates byte sequences to Unicode characters.
This process is complex. A simple text file is a stream of characters. A PDF is a drawing program that happens to draw text. Text extraction is lossy. The order of commands in the file does not always match the visual reading order. Columns and tables can get scrambled. Scanned PDFs contain images with no text layer at all. The parser cannot extract text that does not exist.
PDFs are layout engines, not text files. Parse the structure, don't expect a stream of characters.
Minimal example: open and count pages
Here's the simplest way to open a PDF and check its size. The code uses unidoc/unipdf, a popular library for reading PDFs. Install it with go get github.com/unidoc/unipdf/v3.
package main
import (
"fmt"
"log"
"github.com/unidoc/unipdf/v3/reader"
)
func main() {
// Open file and decode the PDF header.
r, err := reader.FromFile("document.pdf")
if err != nil {
log.Fatal(err)
}
// Defer close to release file handles and internal buffers.
defer r.Close()
// GetNumPages traverses the page tree to count leaves.
pages, err := r.GetNumPages()
if err != nil {
log.Fatal(err)
}
fmt.Printf("PDF has %d pages\n", pages)
}
The reader.FromFile function opens the file and validates the header. It expects %PDF- followed by a version number. If the header is missing or corrupted, the function returns an error. The reader scans the end of the file for the trailer dictionary. The trailer points to the cross-reference table. The reader parses the table to locate all objects.
Objects are loaded lazily. The library does not decode every stream upfront. It waits until you request data. GetNumPages traverses the page tree. The page tree can be nested. The library walks the tree to count leaf nodes. The count reflects the actual number of pages, regardless of how the tree is structured.
Always defer close. File handles are finite resources.
Realistic example: extract text page by page
Real code extracts text page by page, handling errors per page so one bad page doesn't kill the whole file. The contentstream package provides text extraction.
package main
import (
"fmt"
"log"
"github.com/unidoc/unipdf/v3/contentstream"
"github.com/unidoc/unipdf/v3/reader"
)
// ExtractText reads a PDF and returns the text content of all pages.
func ExtractText(path string) ([]string, error) {
// Open file and wrap error for context.
r, err := reader.FromFile(path)
if err != nil {
return nil, fmt.Errorf("open pdf: %w", err)
}
defer r.Close()
// Get page count to drive the loop.
numPages, err := r.GetNumPages()
if err != nil {
return nil, fmt.Errorf("count pages: %w", err)
}
var texts []string
// PDF pages are 1-indexed.
for i := 1; i <= numPages; i++ {
page, err := r.GetPage(i)
if err != nil {
return nil, fmt.Errorf("get page %d: %w", i, err)
}
// ExtractText decodes the content stream and resolves font mappings.
text, err := contentstream.ExtractText(page)
if err != nil {
// Append empty string and continue if extraction fails.
texts = append(texts, "")
continue
}
texts = append(texts, text)
}
return texts, nil
}
The loop starts at index 1. PDF pages are 1-indexed. If you pass 0, the library returns an error. GetPage returns the page object, not the text. The page object contains the content stream, font resources, and media box dimensions. contentstream.ExtractText decodes the stream and interprets the drawing commands. It looks up the font mapping to convert bytes to Unicode.
The code checks errors at every step. GetPage can fail if the page object is corrupted. ExtractText can fail if the content stream is invalid or the font mapping is missing. The code appends an empty string and continues. This strategy ensures that a single broken page does not prevent extraction from the rest of the document.
One bad page shouldn't crash the parser. Handle errors per page and keep going.
Pitfalls and compiler errors
PDF parsing is fragile. You will encounter edge cases.
Scanned PDFs contain images, not text. ExtractText returns empty strings. You need an OCR library to handle those. Encrypted PDFs require a password. reader.FromFile fails with password required if you don't provide one. Large PDFs consume memory. The library loads objects into memory. A 500MB PDF might use gigabytes of RAM. Use streaming approaches if available.
Font mapping errors occur with custom fonts. The library might map glyphs to wrong Unicode points. Text extraction is lossy. The order of text in the file does not always match the visual order. Columns and tables can get scrambled.
The compiler rejects missing imports with undefined: contentstream. If you forget to import the package, the build fails. If you access page 0, the library returns page index out of range. If the file is corrupted, reader.FromFile returns invalid pdf header.
Scanned PDFs have no text. Check for empty results before indexing.
When to use PDF libraries
Use unidoc/unipdf when you need reliable text extraction and metadata reading, and you can handle the commercial license requirements for production. Use pdfcpu when you need an open-source solution for reading and modifying PDF structure, such as merging or splitting files. Use a command-line tool like pdftotext when you just need a quick extraction in a script and don't want to manage Go dependencies. Use plain binary parsing when you are building a custom parser for a specific subset of PDF features, though this is rarely worth the effort.
Pick the library that matches your license budget and feature needs. Don't roll your own PDF parser.