Python API to Convert Word DOCX Content into Web-Ready HTML

Open Source Python Library That allows Software Developers to Read and Convert Microsoft Word DOCX Content into Web-Ready HTML inside Python Apps.

What is Python-Mammoth?

Document conversion has become a crucial necessity for software developers creating apps that interact with text in today's digital environment. A smooth transition between file formats can guarantee compatibility and save time when working on an e-learning platform, document automation tool, or content management system (CMS). One powerful library in this space is Python-Mammoth, an open-source Python library specifically designed for converting Microsoft Word (DOCX) documents into clean and semantic HTML. It supports semantic HTML output, extracting images from DOCX files, custom style mappings, helpful warnings about about unsupported elements or potential formatting issues, easy integration with Python-based applications and many more.

Developed by Michael Williamson, Python-Mammoth is an open source Python library focused on extracting the essential content from DOCX documents and converting them into well-structured HTML. Its primary goal is to produce clean and semantic HTML output without unnecessary inline styles or cluttered markup. Unlike many other document conversion tools, it prioritizes simplicity and accuracy, preserving document semantics like headings, paragraphs, and lists rather than focusing on pixel-perfect representation. The library supports generating clean and consistent HTML reports from Word templates. Its focus on simplicity, clean output, and extensibility makes it an excellent choice for developers seeking document conversion solutions.

At A Glance

An overview of Python-Mammoth features.

Features Overview

Convert DOCX to HTML
DOCX to HTML Converter
Add Paragraphs
Add Table
Extract image
Add Heading
Page Break Support
Set Colors
Text Alignment
Bookmarks Support

Python-Mammoth

Python-Mammoth supports popular compression file formats listed below.

Reader

DOCX

Writer

TXT,DOCX

Python-Mammoth

Platform Independence

Python-Mammoth only requires Python 2.6 & above

Python 2.6, 2.7, 3.3, or 3.4
lxml >= 2.3.2

Python-Mammoth

Getting Started with Python-Mammoth

Python-Mammoth is hosted on PyPI, so It is very simple to install it. It can be installed with pip using the following command.

Install Python-Mammoth via pip command

 pip install mammoth

Word DOCX to HTML Conversion via Python

The open source Python-Mammoth library makes it easy for software developers to load and convert Microsoft Word DOCX file into HTML inside Python applications. One of standout features of the library is its ability to produce clean, semantic HTML output. It avoids embedding unnecessary inline styles or proprietary tags, ensuring the final HTML remains lightweight and easy to style with CSS. The following example shows how DOCX content is converted into HTML, ready to be displayed or styled further.

How to Convert DOCX Content into HTML via Python API?

 import mammoth

with open("document.docx", "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file)
    html = result.value # The generated HTML
    messages = result.messages # Any messages, such as warnings during conversion

Custom Style Mapping Support

The Python-Mammoth library provides a range of customization options, allowing software developers to fine-tune the text extraction process to suit their specific needs. Developers can define custom style mappings to control how DOCX styles are converted into specific HTML elements. This allows for greater flexibility in rendering document content. Here is an example that shows how Heading 1 style in DOCX is explicitly mapped to an HTML h1 tag inside Python applications.

How to MAP Heading 1 Style in DOCX to an HTML H1 Tag inside Python Apps?

style_map = "p[style-name='Heading 1'] => h1:fresh"
with open("document.docx", "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file, style_map=style_map)
    html = result.value
print(html)

Convert DOCX Images to HTML via Python

The open source Python-Mammoth library makes it easy for software developers to extract images from Microsoft Word DOCX files and includes them in the resulting HTML. By default, image references are included as URLs, but developers can customize how images are handled. Here is an example that shows how images from the DOCX file are preserved in the HTML output using Python commands.

How to Convert Images from DOCX File to HTML Output via Python API?

with open("document.docx", "rb") as docx_file:
    result = mammoth.convert_to_html(docx_file, convert_image=mammoth.images.img_element())
    html = result.value

print(html)

Layout Analysis

The open source Python-Mammoth library can analyze the layout of a Word DOCX document, identifying elements such as tables, images, and text blocks. This feature is essential for applications that require accurate extraction of layout information.