1. Products
  2.   OCR
  3.   PHP
  4.   PHP-Apache-Tika
 
  

Free PHP API to Extract Text & Metadata from PDF and Images

Open Source PHP Optical Character Recognition Library allows to Extract Text, Metadata and HTML from PDF, DOCX, Images (JPEG, PNG) & Other Documents in Multiple Languages inside PHP Apps.

In the field of software development, dealing with text from different types of files can be tricky but is a frequent task. Whether you’re creating a system to manage documents, a tool to analyze content, or a search engine, being able to extract text from PDFs, Word documents, spreadsheets, and other file formats is crucial. This is where the PHP-Apache-Tika library becomes valuable. Apache Tika is a flexible toolkit made for managing content analysis jobs. You can use Tika to pull out metadata and text from various file types like PDFs, Microsoft Office files, and images. Tika was initially coded in Java. It’s often set up as its own server, making it accessible through HTTP. This method lets different programming languages, such as PHP, tap into Tika’s strong capabilities without needing to create intricate parsing processes from the ground up.

The library supports numerous features such as text and HTML extraction, metadata extraction, better error handling, OCR recognition, standardized metadata for documents, local and remote resources support, and many more. The PHP-Apache-Tika library bridges PHP applications with the Apache Tika server. Instead of building your own parsers or converters, you can rely on this library to send documents to the Tika server and receive clean, extracted text or metadata in return. This not only simplifies the development process but also ensures that your application benefits from Tika’s continuous improvements and broad format support. Whether you’re developing a complex document management system or a lightweight content analysis tool, the PHP-Apache-Tika library provides a reliable and flexible solution.

Previous Next

Getting Started with PHP-Apache-Tika

The recommend way to install PHP-Apache-Tika is using Composer. Please use the following command for a smooth installation.

Install PHP-Apache-Tika via Composer

composer require vaites/php-apache-tika

Install PHP-Apache-Tika via Github

git clone https://github.com/fizzday/OcrPHP.git 

You can download the compiled shared library from Github repository.

Text and HTML Extraction via PHP

One of the primary features of the PHP-Apache-Tika library is its ability to extract text from various document formats. This can be particularly useful when implementing search functionalities or content analysis tools. The library supports extracting plain text from documents, making it easier to index, search, or analyze content. Here is a code snippet that demonstrates how TikaClient sends the document to the Tika server and retrieves the plain text content, making it ready for further processing or indexing.

How to Extract Text from a Document inside PHP Apps?

require_once 'vendor/autoload.php';

use Vaites\ApacheTika\TikaClient;

// Initialize the Tika client with the Tika server URL
$client = new TikaClient('http://localhost:9998');

// Define the path to the document (e.g., PDF, DOCX, etc.)
$filePath = '/path/to/your/document.pdf';

try {
    // Extract text content from the document
    $extractedText = $client->extract($filePath);
    echo "Extracted Text:\n" . $extractedText;
} catch (\Exception $e) {
    echo "Error extracting text: " . $e->getMessage();
}

Metadata Extraction via PHP Library

Beyond just text, documents often contain valuable metadata such as author information, creation dates, and file types. The PHP-Apache-Tika library can extract this metadata, allowing you to build richer applications. This example demonstrates how to retrieve metadata from a document. The resulting array can include various details depending on the file type and its contents.

How to Extract Metadata using PHP Library?

require_once 'vendor/autoload.php';

use Vaites\ApacheTika\TikaClient;

// Initialize the Tika client
$client = new TikaClient('http://localhost:9998');

// Specify the document file path
$filePath = '/path/to/your/document.pdf';

try {
    // Extract metadata from the document
    $metadata = $client->getMetadata($filePath);
    echo "Extracted Metadata:\n";
    print_r($metadata);
} catch (\Exception $e) {
    echo "Error extracting metadata: " . $e->getMessage();
}

Handling Multiple File Formats

The power of Apache Tika lies in its support for multiple file formats. Whether you’re dealing with PDFs, DOCs, images, or even less common file types, this library helps ensure you can extract the necessary data without worrying about format-specific quirks. Imagine you’re developing a document management system where users can upload different file types. You might use the library to determine both the content and metadata for each file: