1. Products
  2.   PDF
  3.   Python
  4.   Pypdf2
 
  

Develop Apps to Work with PDFs via Python Library

Open Source Python API capable of Splitting, Merging, Cropping, and Transforming the pages of PDF files, add custom data & Passwords to PDF.

PyPDF2 is an open source pure Python library that provides the capability to work with PDF files inside Python applications without any external dependencies. The library has included support for numerous important PDF features such as merging multiple PDF files, extracting the content of PDF file, rotate PDF file pages by an angle, scaling of PDF pages, transforming the pages of PDF files, extracting images from PDF pages and many more.

The open source programming library PyPDF2 is very easy to use and the source code is well documented and easy to understand. The library enables developers to read and extract PDF Files metadata such as the number of pages, author, creator, created and last updated time, etc. The library also supports encrypting and decrypting PDF files with just a couple of lines of Python code.

.

Previous Next

Getting Started with PyPDF2

PyPDF2 doesn’t come as a part of the Python Standard Library, so you will need to install it yourself. The preferred way to do so is to use pip.

Install PyPDF2  via pip

 python -m pip install pypdf2  

Extract Text from PDF via Python

The PyPDF2 library provides capability for programmatically extracting text from PDF files via Python. It is not easy to retrieve data from a PDF file because the way PDF stores information just makes it hard to achieve it. The PyPDF2 makes developers job easy by providing them easy to use built-in functions for retrieving information. They can use the extractText() method on the page object to get the text content of the page.

Extract Text from PDF via Python

 // extract text from a PDF
  from PyPDF2 import PdfReader
  reader = PdfReader("example.pdf")
  page = reader.pages[0]
  print(page.extract_text()) 

Reading PDF Files via Python

The PyPDF2 library provides the capability for programmatically extracting text from PDF files via Python. It is not easy to retrieve data from a PDF file because the way PDF stores information just makes it hard to achieve it. The PyPDF2 makes developers' jobs easy by providing them easy to use built-in functions for retrieving information. They can use the extractText() method on the page object to get the text content of the page.

Reading PDF File via Python

 // Reading text from a PDF
    from PyPDF2 import PdfReader

  reader = PdfReader("example.pdf")

  for page in reader.pages:
    if "/Annots" in page:
      for annot in page["/Annots"]:
        subtype = annot.get_object()["/Subtype"]
        if subtype == "/Text":
          print(annot.get_object()["/Contents"]) 

Merge or Split PDF Documents

Have you ever been in a situation where you needed to merge two or more PDF files into a single document? The organization often requires merging multiple PDF files into a single document. The PyPDF2 library provides the capability to combine PDF files with just a couple of lines of Python code. Developers can also easily split large PDF documents into smaller ones according to their needs. Developers can easily extract a specific part of a PDF book or divided it into multiple PDFs

Merge PDF Files via Python

 // Merge PDF files 
  from PyPDF2 import PdfMerger

  merger = PdfMerger()

  for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    merger.append(pdf)

  merger.write("merged-pdf.pdf")
  merger.close()

Extract Metadata from PDF Files

The PyPDF2 library has included functionality for extracting Metadata from PDF documents by using a couple of Python commands. You can easily get information about the author, the creator app, number of pages, document title, and creation dates, etc. You can easily extract metadata of PDF documents and use it according to your needs.

 

Extract Metadata from PDF via Python

 // Reading PDF Metadata 
  from PyPDF2 import PdfReader

reader = PdfReader("example.pdf")

meta = reader.metadata

print(len(reader.pages))

# All of the following could be None!
print(meta.author)
print(meta.creator)
print(meta.producer)
print(meta.subject)
print(meta.title)
 English