Free PHP API to Extract Text & Metadata from PDF and Images

Open Source PHP Optical Character Recognition Library allows to Extract Text, Metadata and HTML from PDF, DOCX, Images (JPEG, PNG) & Other Documents in Multiple Languages inside PHP Apps.

What is PHP-Apache-Tika?

For PHP developers, extracting text from diverse file formats like PDFs, Word documents, and spreadsheets is essential for applications in document management, content analysis, and search. The PHP-Apache-Tika library simplifies this by connecting your projects to Apache Tika—a powerful Java toolkit for content analysis. Tika operates as its own server, accessible via HTTP, allowing PHP applications to leverage its robust text and metadata extraction without building complex parsers from scratch.

This library offers extensive features, including OCR recognition, HTML and metadata extraction, and support for local and remote resources. By sending documents to the Tika server, it returns clean, standardized text and metadata, streamlining development. Utilizing PHP-Apache-Tika ensures your application benefits from continuous updates and broad file format support. Whether for a sophisticated document management system or a simple analysis tool, it provides a reliable, flexible solution for enhanced data processing.

At A Glance

An overview of PHP-Apache-Tika features.

Features Overview

Perform OCR
Add OCR Capabilities
Recognize text in Many Languages
Convet Images of Text
Recognized Font Text
Search PDF
Other Languages
Create OCR Apps
Save to Browser
Extract Text
Multi-threading Support

PHP-Apache-Tika

PHP-Apache-Tika supports popular compression file formats listed below.

Reader

PNG, JPEG, BMP, TIFF, TGA, DICOM

Writer

PNG, JPEG, BMP, TIFF

PHP-Apache-Tika

Platform Independence

PHP-Apache-Tika only requires PHP Runtime.

PHP 5.1 and above.

PHP-Apache-Tika

Getting Started with PHP-Apache-Tika

The recommend way to install PHP-Apache-Tika is using Composer. Please use the following command for a smooth installation.

Install PHP-Apache-Tika via Composer

composer require vaites/php-apache-tika

Install PHP-Apache-Tika via Github

git clone https://github.com/fizzday/OcrPHP.git

You can download the compiled shared library from Github repository.

Text and HTML Extraction via PHP

One of the primary features of the PHP-Apache-Tika library is its ability to extract text from various document formats. This can be particularly useful when implementing search functionalities or content analysis tools. The library supports extracting plain text from documents, making it easier to index, search, or analyze content. Here is a code snippet that demonstrates how TikaClient sends the document to the Tika server and retrieves the plain text content, making it ready for further processing or indexing.

How to Extract Text from a Document inside PHP Apps?

require_once 'vendor/autoload.php';

use Vaites\ApacheTika\TikaClient;

// Initialize the Tika client with the Tika server URL
$client = new TikaClient('http://localhost:9998');

// Define the path to the document (e.g., PDF, DOCX, etc.)
$filePath = '/path/to/your/document.pdf';

try {
    // Extract text content from the document
    $extractedText = $client->extract($filePath);
    echo "Extracted Text:\n" . $extractedText;
} catch (\Exception $e) {
    echo "Error extracting text: " . $e->getMessage();
}

Metadata Extraction via PHP Library

Beyond just text, documents often contain valuable metadata such as author information, creation dates, and file types. The PHP-Apache-Tika library can extract this metadata, allowing you to build richer applications. This example demonstrates how to retrieve metadata from a document. The resulting array can include various details depending on the file type and its contents.

How to Extract Metadata using PHP Library?

require_once 'vendor/autoload.php';

use Vaites\ApacheTika\TikaClient;

// Initialize the Tika client
$client = new TikaClient('http://localhost:9998');

// Specify the document file path
$filePath = '/path/to/your/document.pdf';

try {
    // Extract metadata from the document
    $metadata = $client->getMetadata($filePath);
    echo "Extracted Metadata:\n";
    print_r($metadata);
} catch (\Exception $e) {
    echo "Error extracting metadata: " . $e->getMessage();
}

Handling Multiple File Formats

The power of Apache Tika lies in its support for multiple file formats. Whether you’re dealing with PDFs, DOCs, images, or even less common file types, this library helps ensure you can extract the necessary data without worrying about format-specific quirks. Imagine you’re developing a document management system where users can upload different file types. You might use the library to determine both the content and metadata for each file: