Open Source Python API to Add OCR to PDF Files
Free Python OCR API to Automates the OCR Process and Facilitates the Conversion of Scanned Image PDFs into fully Searchable Documents.
What is OCRmyPDF?
OCRmyPDF is a versatile, open-source Python library and command-line tool, purpose-built to add Optical Character Recognition to existing PDF files with exceptional accuracy. It intelligently analyzes each page to determine the optimal colorspace and resolution, ensuring no original content is lost. Supporting a wide array of input formats—including scanned images, standard PDFs, and DjVu files—it operates on an "image plus text" principle to produce high-quality, searchable outputs while preserving the document's original structure and formatting.
Powered by the robust Tesseract OCR engine, which supports over 100 languages, OCRmyPDF delivers precise text recognition even from low-quality or distorted images. It enhances document quality with image processing features like deskew and employs PDF optimization techniques such as compression to reduce file size without sacrificing integrity. The library enables the straightforward generation of searchable PDF/A files and offers comprehensive features like text layer control and automated batch processing. This makes it an invaluable, efficient tool for businesses, researchers, and archivists managing large volumes of scanned documents.
Getting Started with OCRmyPDF
The recommend way to install OCRmyPDF is using pip. Please use the following command for a smooth installation.
Install OCRmyPDF via pip
pip install ocrmypdf You can also install it manually; download the latest release files directly from GitHub repository.
PDF optimization using Python API
The open source OCRmyPDF library has provided support a very useful features to manage the size and quality of PDF documents inside Python applications. The library employs PDF optimization techniques to reduce file size while maintaining the highest possible quality. By applying compression and down-sampling, it ensures that the resulting OCR-enabled PDF files are both efficient to store and quick to load. OCRmyPDF provides several optimization options that you can customize based on your requirements. Some commonly used options include removing temporary files, applying JBIG2 compression, skipping adding the OCR, disabling lossless compression to maximize file size reduction and so on.
How to Optimize PDF Files using Python API?
import subprocess
def optimize_pdf_with_ocrmypdf(input_pdf_path, output_pdf_path):
try:
# OCRmyPDF command with optimization options
command = ['ocrmypdf', '-l', 'eng', '--pdf-renderer', 'hocr', '--optimize', '0', input_pdf_path, output_pdf_path]
# Execute the OCRmyPDF command
subprocess.run(command, check=True)
print("PDF optimization complete!")
except subprocess.CalledProcessError as e:
print(f"OCRmyPDF error: {e}")
# Example usage
input_pdf_path = 'input.pdf'
output_pdf_path = 'output.pdf'
optimize_pdf_with_ocrmypdf(input_pdf_path, output_pdf_path)
PDF Text Layer Integration via Python API
OCRmyPDF, an open-source library, provides a powerful solution for integrating text layers into PDF files, enhancing document accessibility and search-ability. The library adds a text layer containing OCR-generated text directly onto the PDF document, ensuring the preservation of the original layout. This feature enables full-text searching, copy-pasting, and text extraction. When working with PDF documents, having a text layer integrated within the file is highly advantageous. The text layer contains the recognized OCR-generated text, making the PDF searchable and allowing for easy copying and extraction of text. This integration preserves the original document layout while enabling text-based operations, enhancing document usability and efficiency.