Develop Apps to Work with PDFs via Python Library
Open Source Python API capable of Splitting, Merging, Cropping, and Transforming the pages of PDF files, add custom data & Passwords to PDF.
What is PyPDF2 Library?
PyPDF2 is an open source pure Python library that provides the capability to work with PDF files inside Python applications without any external dependencies. The library has included support for numerous important PDF features such as merging multiple PDF files, extracting the content of PDF file, rotate PDF file pages by an angle, scaling of PDF pages, transforming the pages of PDF files, extracting images from PDF pages and many more.
The open source programming library PyPDF2 is very easy to use and the source code is well documented and easy to understand. The library enables developers to read and extract PDF Files metadata such as the number of pages, author, creator, created and last updated time, etc. The library also supports encrypting and decrypting PDF files with just a couple of lines of Python code.
.
Getting Started with PyPDF2
PyPDF2 doesn’t come as a part of the Python Standard Library, so you will need to install it yourself. The preferred way to do so is to use pip.
Install PyPDF2 via pip
python -m pip install pypdf2
Extract Text from PDF via Python
The PyPDF2 library provides capability for programmatically extracting text from PDF files via Python. It is not easy to retrieve data from a PDF file because the way PDF stores information just makes it hard to achieve it. The PyPDF2 makes developers job easy by providing them easy to use built-in functions for retrieving information. They can use the extractText() method on the page object to get the text content of the page.
How to Extract Text from PDF via Python API?
// extract text from a PDF
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
page = reader.pages[0]
print(page.extract_text())
Reading PDF Files via Python Library
The PyPDF2 library provides the capability for programmatically extracting text from PDF files via Python. It is not easy to retrieve data from a PDF file because the way PDF stores information just makes it hard to achieve it. The PyPDF2 makes developers' jobs easy by providing them easy to use built-in functions for retrieving information. They can use the extractText() method on the page object to get the text content of the page.
How to Read PDF File via Python Library?
// Reading text from a PDF
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
for page in reader.pages:
if "/Annots" in page:
for annot in page["/Annots"]:
subtype = annot.get_object()["/Subtype"]
if subtype == "/Text":
print(annot.get_object()["/Contents"])
Merge or Split PDF Documents via Python
Have you ever been in a situation where you needed to merge two or more PDF files into a single document? The organization often requires merging multiple PDF files into a single document. The PyPDF2 library provides the capability to combine PDF files with just a couple of lines of Python code. Developers can also easily split large PDF documents into smaller ones according to their needs. Developers can easily extract a specific part of a PDF book or divided it into multiple PDFs
How to Merge PDF Files via Python Library?
// Merge PDF files
from PyPDF2 import PdfMerger
merger = PdfMerger()
for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
merger.append(pdf)
merger.write("merged-pdf.pdf")
merger.close()
Extract Metadata from PDF Files
The PyPDF2 library has included functionality for extracting Metadata from PDF documents by using a couple of Python commands. You can easily get information about the author, the creator app, number of pages, document title, and creation dates, etc. You can easily extract metadata of PDF documents and use it according to your needs.
How to Extract Metadata from PDF via Python?
// Reading PDF Metadata
from PyPDF2 import PdfReader
reader = PdfReader("example.pdf")
meta = reader.metadata
print(len(reader.pages))
# All of the following could be None!
print(meta.author)
print(meta.creator)
print(meta.producer)
print(meta.subject)
print(meta.title)