pypdftotext — Python PDF Text Extraction with OCR

Extract text from PDFs in Python using pypdf and Azure Document Intelligence OCR. Handles embedded text and scanned PDFs with automatic OCR fallback, batch processing, and S3 support.

pypdftotext is a Python library for PDF text extraction that uses pypdf's layout mode for embedded text and Azure Document Intelligence (Form Recognizer) for OCR when pages have little or no embedded text. Use it to extract text from PDF files, process scanned PDFs, run batch OCR, and read PDFs from AWS S3.

Features

Embedded text extraction — Fast extraction via pypdf layout mode
Automatic OCR fallback — Azure Document Intelligence when embedded text is missing or sparse
Scanned PDF support — OCR for image-based and scanned PDFs
Batch PDF processing — Process multiple PDFs with parallel OCR
Thread-safe API — Use PdfExtract in multi-threaded workflows
S3 support — Read PDFs directly from AWS S3 URIs (s3://bucket/key.pdf)
Image compression — Optional preprocessing to reduce file size and improve OCR
Handwritten text detection — Confidence scoring for handwritten content
Page splitting & clipping — Create child PDFs and extract page ranges
Flexible configuration — Env vars, constants, and per-instance config with inheritance

Installation

Basic Installation

pip install pypdftotext

Optional Dependencies

# S3 support (read PDFs from AWS S3)
pip install "pypdftotext[s3]"

# Image compression for scanned PDFs
pip install "pypdftotext[image]"

# All optional features (S3 + image)
pip install "pypdftotext[full]"

# Development (full + type stubs, pytest, coverage)
pip install "pypdftotext[dev]"

Requirements

Python 3.10, 3.11, or 3.12
pypdf 6.0
azure-ai-documentintelligence ≥ 1.0.0
tqdm (progress bars)
boto3 (optional, for S3)
pillow (optional, for image compression)

Quick Start

Enable Azure OCR (optional)

Without Azure OCR configured, pypdftotext returns only embedded text from the PDF (via pypdf layout mode). To support scanned PDFs and image-based pages, set up Azure Document Intelligence.

Prerequisites:

Configuration: Set endpoint and key via environment variables or the constants module (same pattern applies to AWS credentials for S3).

export AZURE_DOCINTEL_ENDPOINT="https://your-resource.cognitiveservices.azure.com/"
export AZURE_DOCINTEL_SUBSCRIPTION_KEY="your-subscription-key"

from pypdftotext import constants
constants.AZURE_DOCINTEL_ENDPOINT = "https://your-resource.cognitiveservices.azure.com/"
constants.AZURE_DOCINTEL_SUBSCRIPTION_KEY = "your-subscription-key"

You can also set these (and other options) on each PdfExtract instance via its config attribute. See Configuration.

Basic Usage

Create an extractor and get text:

from pypdftotext import PdfExtract

extract = PdfExtract("document.pdf")

# Full text
text = extract.text
print(text)

# Per-page text
for i, page_text in enumerate(extract.text_pages):
    print(f"Page {i + 1}: {page_text[:100]}...")

Optional: customize config per instance

extract.config.AZURE_DOCINTEL_ENDPOINT = "https://your-resource.cognitiveservices.azure.com/"
extract.config.AZURE_DOCINTEL_SUBSCRIPTION_KEY = "your-subscription-key"
extract.config.PRESERVE_VERTICAL_WHITESPACE = True

Compress images in scanned PDFs (requires pypdftotext[image]). Do this before accessing text or text_pages so OCR uses the compressed PDF.

extract.compress_images(
    white_point=200,       # Remove scanner artifacts (pixels 201–255 → white)
    aspect_tolerance=0.01,
    max_overscale=1.5,
)

Save corrected/compressed PDF:

from pathlib import Path
Path("compressed_corrected_document.pdf").write_bytes(extract.body)

Split and clip pages:

# First 10 pages as new PdfExtract (keeps config/metadata)
extract_child = extract.child((0, 9))

# PDF bytes for pages 1, 3, 5 (0-indexed: 0, 2, 4)
clipped_bytes = extract_child.clip_pages([0, 2, 4])

Configuration

PyPdfToTextConfig and PyPdfToTextConfigOverrides control behavior. New configs:

Load from environment variables, then
Inherit from the global constants (unless disabled),
Optionally use a custom base config,
Apply overrides (overrides win over base/constants).

Disable inheritance from constants:

constants.INHERIT_CONSTANTS = False
# or for one config:
from pypdftotext import PyPdfToTextConfig
config = PyPdfToTextConfig(overrides={"INHERIT_CONSTANTS": False})

OCR triggering: OCR runs when the share of “low-text” pages ≥ TRIGGER_OCR_PAGE_RATIO (default 0.99). A page is low-text if it has ≤ MIN_LINES_OCR_TRIGGER lines (default 1).

Example — trigger OCR when 50% of pages have fewer than 5 lines:

from pypdftotext import PyPdfToTextConfig

config = PyPdfToTextConfig(
    MIN_LINES_OCR_TRIGGER=5,
    TRIGGER_OCR_PAGE_RATIO=0.5,
)
extract = PdfExtract("document.pdf", config=config)

Batch Processing

Process multiple PDFs with parallel OCR:

from pypdftotext.batch import PdfExtractBatch

pdfs = ["file1.pdf", "file2.pdf", "file3.pdf"]
# or by name: {"report": "report.pdf", "invoice": "invoice.pdf"}

batch = PdfExtractBatch(pdfs)
results = batch.extract_all()  # dict[str, PdfExtract]

for name, pdf_extract in results.items():
    print(f"{name}: {len(pdf_extract.text)} characters")

Embedded text is extracted first; OCR runs in parallel for PDFs that need it.

S3 and Optional Features

S3: Pass an S3 URI as the pdf argument (e.g. s3://bucket/path/file.pdf). Configure AWS credentials via env vars or programmatically (same style as Azure above). Requires pypdftotext[s3].

Image compression: Use extract.compress_images(...) before reading text when you need smaller files or better OCR on scanned PDFs. Requires pypdftotext[image].

Implementation Details

Page indices are 0-based.
OCR is triggered by the ratio of low-text pages and line-count threshold (see Configuration).
Corruption detection: Pages over 25,000 characters are treated as corrupted and return empty text.
Progress: tqdm is used for progress bars; disable or position via config for scripts/logging.

Author & Contact

KuchikiRenji

GitHub: github.com/KuchikiRenji
Email: KuchikiRenji@outlook.com
Discord: kuchiki_renji

License

This project is licensed under the MIT License — see the LICENSE file for details.

Links

Acknowledgments

Built on:

pypdf — PDF parsing and layout text extraction
Azure Document Intelligence — OCR and document understanding

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.claude		.claude
.devcontainer		.devcontainer
.vscode		.vscode
pypdftotext		pypdftotext
samples		samples
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
pypdftotext.code-workspace		pypdftotext.code-workspace
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pypdftotext — Python PDF Text Extraction with OCR

Table of Contents

Features

Installation

Basic Installation

Optional Dependencies

Requirements

Quick Start

Enable Azure OCR (optional)

Basic Usage

Configuration

Batch Processing

S3 and Optional Features

Implementation Details

Author & Contact

License

Links

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pypdftotext — Python PDF Text Extraction with OCR

Table of Contents

Features

Installation

Basic Installation

Optional Dependencies

Requirements

Quick Start

Enable Azure OCR (optional)

Basic Usage

Configuration

Batch Processing

S3 and Optional Features

Implementation Details

Author & Contact

License

Links

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages