Open Source PHP Library for Parsing PDF Files

Free PHP API allows Developers to Parse PDF Files, Extract Data & Elements from PDFs.

 

PDFParser is an Open source PHP Library that allows software developers to parse PDF files and extract PDF elements inside their own PHP applications. PDFParser is built on top of TCPDF parser. PDFParser is a standalone PHP library that provides various tools to extract data from a PDF file.

Portable Document Format (PDF) is one of the World’s favorite document formats and still very popular. The API supports several important features for PDF parsing, such as loading and parsing PDF objects and headers, extracting metadata, extracting text from ordered pages, compressed PDF support, Hexa and octal content encoding support and many more.

.

Getting Started with PDFParser

The PDFParser library will be automatically downloaded through the composer command line. Add PDFParser to your composer.json file.

Add  command to composer.json

 { 
   "require": {
    "smalot/pdfparser": "*"
    } 
 } 

Use the composer to download the bundle by running the command:

Install PDFParser via composer

$ composer update smalot/pdfparser

You can also install it manually, download it from the GitHub repository. Once done, unzip it and run the following command using composer.

Install PDFParser manually via composer

$ composer update

It will download any dependencies (Atoum library) and will generate 'autoload.php' file.

Parse PDF File & Extract Text from Each Page via PHP API

PDFParser provides the functionality that enables computer programmers to parse PDF documents inside their own PHP application. First, you need to build necessary objects then load the PDF file, the parsed file can be stored on a variable and then this object will allow you to handle the PDF page by page. Now you can easily extract text from the entire PDF or separately by pages. Once the document is parsed now you can easily extract text from each page of the PDF.

Extract Metadata from PDF Document

Metadata includes very important information about the PDF document and its contents such as Author, copyright information, creator, Creation Date and more. PDFParser gives developers the power to extract metadata from a PDF document. Once the document is parsed you can easily retrieve all details from the PDF file.

Extract Text from a Specific PDF Page

PDFParser allows developers to extract text from specific pages with ease by using a small amount of code. The API gives developers the ability to separately handle each page of the PDF document. Developers can iterate through the array of pages and can retrieve text from the page of their choice. The order of the array is the same as that of the PDF document.