Open Source PHP Library for Parsing & Extracting PDF Data

Free PHP API allows Developers to Load, Read & Parse PDF Files, Extract PDF Elements (Text, Images, Metadata) & Other Data from PDFs inside PHP Apps.

What is PDFParser Library?

PDFParser is an Open source PHP Library that allows software developers to parse PDF files and extract PDF elements inside their own PHP applications. PDFParser is built on top of TCPDF parser. PDFParser is a standalone PHP library that provides various tools to extract data from a PDF file. With a rich set of features, this library simplifies the complexities of working with PDFs, making it an invaluable asset for projects that require seamless integration with PDF documents.

Portable Document Format (PDF) is one of the World’s favorite document formats and is still very popular. The API supports several important features for PDF parsing, such as loading and parsing PDF objects and headers, extracting metadata, extracting text from ordered pages, compressed PDF support, Hexa and octal content encoding support, and many more. The library is equipped with robust error handling mechanisms, ensuring that developers can easily identify and address issues during the parsing process.

At A Glance

An overview of PDFParser features.

Features Overview

Load PDF objects
Parse objects
Parse headers
Extract metadata
Extract text
Compressed PDF
charset encoding
Hexa encoding
Octal encoding

PDFParser

PDFParser supports PDF file format as well as industry-standard formats for export.

Reader

Writer

TXT, HTML

PDFParser

Platform Independence

PDFParser only requires PHP runtime.

PHP 5.3 and above.

PDFParser

Getting Started with PDFParser

The PDFParser library will be automatically downloaded through the composer command line. Add PDFParser to your composer.json file.

Add command to composer.json

 { 
  "require": {
  "smalot/pdfparser": "*"
  } 
 }

Use the composer to download the bundle by running the command:

Parse PDF File & Extract Text from Each Page via PHP API

PDFParser provides the functionality that enables computer programmers to parse PDF documents inside their own PHP application. First, you need to build necessary objects then load the PDF file, the parsed file can be stored on a variable and then this object will allow you to handle the PDF page by page. Now you can easily extract text from the entire PDF or separately by pages. Once the document is parsed now you can easily extract text from each page of the PDF.

How to Parse PDF File via PHP API?

  // Include Composer autoloader if not already done.
  include 'vendor/autoload.php';

  // Parse Base64 encoded PDF string and build necessary objects.
  $parser = new \Smalot\PdfParser\Parser();
  $pdf  = $parser->parseContent(base64_decode($base64PDF));

  $text = $pdf->getText();
  echo $text;

Extract Metadata from PDF Document

Metadata includes very important information about the PDF document and its contents such as Author, copyright information, creator, Creation Date and more. PDFParser gives developers the power to extract metadata from a PDF document. Once the document is parsed you can easily retrieve all details from the PDF file.

How to Extract Metadata from PDF via PHP API?

  // Metadata Extraction from PDF 
  $metaData = $pdf->getDetails();
  Array
  (
   [Producer] => Adobe Acrobat
   [CreatedOn] => 2022-01-28T16:36:11+00:00
   [Pages] => 35
  )

Extract Text from a Specific PDF Page

PDFParser allows developers to extract text from specific pages with ease by using a small amount of code. The API gives developers the ability to separately handle each page of the PDF document. Developers can iterate through the array of pages and can retrieve text from the page of their choice. The order of the array is the same as that of the PDF document.

How to Extract Text from PDF via PHP?

  // Extract Text from PDF via PHP
  $text = $pdf->getText();

  // or extract the text of a specific page (in this case the first page)

  $text = $pdf->getPages()[0]->getText();