Create & Convert PDF to Docx via Open Source Python Library

Free Python API Capable of Creating and Converting PDF Documents to DOCX, Parse and Re-create Page Layout or Re-create Paragraph via Python Library.

What is pdf2docx Library?

There are many Python libraries for PDF document creation and processing. AS Python is considered to be the best language for handling PDF processing because it makes development so easy and fast. pdf2docx is one such powerful open source Python library that enables computer programmers to create and convert PDF documents to Word DOCX file format with ease. The library is very simple to handle and has a simple GUI that enables users to easily access and use various features of the library.

The pdf2docx library has included various features for handling PDF operations such as accessing PDF documents, converting PDF to other file formats, parsing and re-create page layout, page margin support, extracting meta-information, extracting text from PDF files, parsing and re-creating paragraph, inserting text to PDF, list styles support, Parse and re-create the image, transparent image, Parse and re-create the table, merged cells, table with partly hidden borders, nested tables support, Parsing pages with multi-processing, and many more.

At A Glance

An overview of pdf2docx features.

Features Overview

Create PDF
Convert PDF to DOCX
Re-create page layout
List styles support
Re-create table
Extract text from PDF
Parse & Re-create table
Multi-processing support
Nested tables s/li>
Font embedding
Convert specified pages
Transparent image
Convert encrypted PDF

pdf2docx

pdf2docx supports PDF file format as well as industry-standard formats for export.

Reader

Writer

TXT, HTML

pdf2docx

Platform Independence

pdf2docx is tested with Python 3.8 and higher.

Python 3.8 & higher

pdf2docx

Getting Started with pdf2docx

pdf2docx is very easy to install, The preferred way to do so is to use pip, please use the following command for any easy installation.

Install pdf2docx via pip

 pip install pdf2docx

It is also possible to install it manually; download the latest release files directly from GitHub repository.

Convert PDF File to Docx via Python API

The open source pdf2docx library fully supports PDF file conversion to Docx file format with just a couple of lines of Python code. The library has provided several methods for handling PDF conversion. You can convert all pages of a document or select some specific pages and convert them to a Docx file. The library also supports accessing and converting password-protected PDF documents inside Python applications. The library also supports multi-processing which only works for continuous PDF pages, specified by start and end only.

How to Convert All Pages of a PDF via Python API?

from pdf2docx import Converter

pdf_file = '/path/to/sample.pdf'
docx_file = 'path/to/sample.docx'

# convert pdf to docx
cv = Converter(pdf_file)
cv.convert(docx_file)      # all pages by default
cv.close()

Convert Specified PDF Pages to Docx via Python


# convert from the second page to the end (by default)
cv.convert(docx_file, start=1)

# convert from the first page (by default) to the third (end=3, excluded)
cv.convert(docx_file, end=3)

# Alternatively, set separate pages by pages

# convert the first, third and 5th pages
cv.convert(docx_file, pages=[0,2,4])

Extract Table from PDF via Python API

Sometimes we need to extract some specific data from a PDF file. The free pdf2docx library allows users to extract tables from PDF files without any external dependencies. To achieve this task you need to use the extract_tables() function. The following examples can be used to extract all tables from a PDF file.

How to Extract PDF Table via Python API?

from pdf2docx import Converter

pdf_file = '/path/to/sample.pdf'

cv = Converter(pdf_file)
tables = cv.extract_tables(start=0, end=1)
cv.close()

for table in tables:
    print(table)

How to Extract All Tables from PDF via Python API?

extrated_tables_list = extract_tables(pdf_with_path, start={int page id}, end={int page id})
for obj in extrated_tables_list :
    print(obj)