Free PHP API to Extract Text & Metadata from PDF and Images
Open Source PHP Optical Character Recognition Library allows to Extract Text, Metadata and HTML from PDF, DOCX, Images (JPEG, PNG) & Other Documents in Multiple Languages inside PHP Apps.
What is PHP-Apache-Tika?
For PHP developers, extracting text from diverse file formats like PDFs, Word documents, and spreadsheets is essential for applications in document management, content analysis, and search. The PHP-Apache-Tika library simplifies this by connecting your projects to Apache Tika—a powerful Java toolkit for content analysis. Tika operates as its own server, accessible via HTTP, allowing PHP applications to leverage its robust text and metadata extraction without building complex parsers from scratch.
This library offers extensive features, including OCR recognition, HTML and metadata extraction, and support for local and remote resources. By sending documents to the Tika server, it returns clean, standardized text and metadata, streamlining development. Utilizing PHP-Apache-Tika ensures your application benefits from continuous updates and broad file format support. Whether for a sophisticated document management system or a simple analysis tool, it provides a reliable, flexible solution for enhanced data processing.
Getting Started with PHP-Apache-Tika
The recommend way to install PHP-Apache-Tika is using Composer. Please use the following command for a smooth installation.
Install PHP-Apache-Tika via Composer
composer require vaites/php-apache-tikaInstall PHP-Apache-Tika via Github
git clone https://github.com/fizzday/OcrPHP.git You can download the compiled shared library from Github repository.
Text and HTML Extraction via PHP
One of the primary features of the PHP-Apache-Tika library is its ability to extract text from various document formats. This can be particularly useful when implementing search functionalities or content analysis tools. The library supports extracting plain text from documents, making it easier to index, search, or analyze content. Here is a code snippet that demonstrates how TikaClient sends the document to the Tika server and retrieves the plain text content, making it ready for further processing or indexing.
How to Extract Text from a Document inside PHP Apps?
require_once 'vendor/autoload.php';
use Vaites\ApacheTika\TikaClient;
// Initialize the Tika client with the Tika server URL
$client = new TikaClient('http://localhost:9998');
// Define the path to the document (e.g., PDF, DOCX, etc.)
$filePath = '/path/to/your/document.pdf';
try {
// Extract text content from the document
$extractedText = $client->extract($filePath);
echo "Extracted Text:\n" . $extractedText;
} catch (\Exception $e) {
echo "Error extracting text: " . $e->getMessage();
}
Metadata Extraction via PHP Library
Beyond just text, documents often contain valuable metadata such as author information, creation dates, and file types. The PHP-Apache-Tika library can extract this metadata, allowing you to build richer applications. This example demonstrates how to retrieve metadata from a document. The resulting array can include various details depending on the file type and its contents.
How to Extract Metadata using PHP Library?
require_once 'vendor/autoload.php';
use Vaites\ApacheTika\TikaClient;
// Initialize the Tika client
$client = new TikaClient('http://localhost:9998');
// Specify the document file path
$filePath = '/path/to/your/document.pdf';
try {
// Extract metadata from the document
$metadata = $client->getMetadata($filePath);
echo "Extracted Metadata:\n";
print_r($metadata);
} catch (\Exception $e) {
echo "Error extracting metadata: " . $e->getMessage();
}
Handling Multiple File Formats
The power of Apache Tika lies in its support for multiple file formats. Whether you’re dealing with PDFs, DOCs, images, or even less common file types, this library helps ensure you can extract the necessary data without worrying about format-specific quirks. Imagine you’re developing a document management system where users can upload different file types. You might use the library to determine both the content and metadata for each file: