Render PDF Files, Extract Text & Images via Python Library
Free Python API allows to Read, Write, Edit and Render PDF Files; Extract Text & Images, Edit PDF Pages, Merge/Split & Convert PDFs with ease.
What is PyMuPDF Library?
PyMuPDF is a lightweight open source Python API that adds Python bindings and abstractions to MuPDF. The API is small in size but yet very speedy and provided supports a number of popular documents formats including PDF, XPS, OpenXPS, CBZ, EPUB, and FB2 (eBooks) formats as well as about 10 popular image formats can also be opened, and handled like documents. The PyMuPD is very reliable and is known for its top rendering capability. As the library is very light weight makes it a great choice for platforms where resources are usually limited, like smartphones.
There are numerous basic and advanced features supported by the PyMuPDF API for PDF document rendering and conversions, such as converting PDF to PNG, accessing and viewing metadata, working with outlines, rendering a page into a raster or vector (SVG) image, PDF text Searching support, extract text from PDF page, extract images from PDF, displaying the Image in GUIs, modifying PDF page, creating new PDF pages, deleting unwanted PDF pages, Embedding Data and so on. The PyMuPDF has included support for numerous platforms, such as Mac, Linux, and Windows.
Getting Started with PyMuPDF
PyMuPDF can be installed using pip, the following commands will install from a Python wheel if one is available for your platform.
Install PyMuPDF via pip
python -m pip install --upgrade pip
python -m pip install --upgrade pymupdf
Clone PyMuPDF via git Repository
git clone https://github.com/pymupdf/PyMuPDF.git
It is also possible to install it manually; download the latest release files directly from GitHub repository.
Searching for Text in PDF Files via Python
PDF has been one of the World’s favorite file formats for sharing documents across the internet because it retains all the text formatting and graphics inside it. But it is not easy to search for text inside these files as compared to other documents. The free PyMuPDF library allows software developers to add text-searching capabilities inside their Python applications. It allows searching out where on the page a certain text string exists.
Search Where on the PDF Page Text String Appears via Python
areas = page.search_for("mupdf")
Extracting PDF Text and Images via Python API
The open source PyMuPDF library has included several important features for working with PDF text and images. The library has provided various functions for extracting text as well as images from PDF documents. By default, it allows the extraction of plain text with line breaks. No formatting, no text position details, no images. Moreover, it supports generating a list of text blocks, generating a list of words, creating a full visual version of the page including any images, and many more.
How to Extract Text from PDF via Python API?
from operator import itemgetter
from itertools import groupby
import fitz
doc = fitz.open( 'mydocument.pdf' )
pages = [ doc[ i ] for i in range( doc.pageCount ) ]
for page in pages:
text_words = page.getTextWords()
# The words should be ordered by y1 and x0
sorted_words = SortedCollection( key = itemgetter( 3, 0 ) )
for word in text_words:
sorted_words.insert( word )
# At this point you already have an ordered list. If you need to
# group the content by lines, use groupby with y1 as a key
lines = groupby( sorted_words, key = itemgetter( 3 ) )
Join and Split PDF Documents in Python APPs
Combining different PDF files is a very useful feature that gives users the ability to have one PDF rather than having a dozen separate PDFs. The free and open-source cross-platform PyMuPDF library gives software programmers the power to merge different files or copy pages between different PDF documents with ease. It also gives users the power to split large PDF documents into smaller files with just a couple of lines of Python code. It is also possible to select some specific pages of a PDF document and create a new document out of it.
How to Create New Document From First & Last 10 Pages via Python?
doc2 = fitz.open() # new empty PDF
doc2.insert_pdf(doc1, to_page = 9) # first 10 pages
doc2.insert_pdf(doc1, from_page = len(doc1) - 10) # last 10 pages
doc2.save("first-and-last-10.pdf")
Read & Export PDF Metadata to CSV via Python
The open source PyMuPDF library has provided complete functionality for accessing and reading metadata of PDF files without any external dependencies. It supports various types of metadata keys such as date for creation, author, title, creator application, any subject, encryption method, file format, and so on. It is also possible to export metadata to CSV format.
How to Export PDF Metadata to CSV via Python API?
import csv
import fitz
import argparse
parser = argparse.ArgumentParser(description="Enter CSV delimiter [;], CSV filename and documment filename")
parser.add_argument('-d', help='CSV delimiter [;]', default = ';')
parser.add_argument('-x', help='delete XML info [n]', default = 'n')
parser.add_argument('-csv', help='CSV filename')
parser.add_argument('-pdf', help='PDF filename')
args = parser.parse_args()
delim = args.d # requested CSV delimiter character
assert args.csv, "missing CSV filename"
assert args.pdf, "missing PDF filename"
print "delimiter", args.d
print "xml delete", args.x
print "csv file", args.csv
print "pdf file", args.pdf
print "----------------------------------------"
doc = fitz.open(args.pdf)
oldmeta = doc.metadata
print "old metadata:"
for k,v in oldmeta.items():
print k, ":",v
with open(args.csv) as tocfile:
tocreader = csv.reader(tocfile, delimiter = delim)
for row in tocreader:
assert len(row) == 2, "each row must contain 2 entries"
oldmeta[row[0]] = row[1]
print "----------------------------------------"
print "\nnew metadata:"
for k,v in oldmeta.items():
print k, ":",v
doc.set_metadata(oldmeta)
doc.saveIncr()