Free PHP Library for OCR operations on Images & PDFs

Open Source PHP Optical Character Recognition API to Perform OCR Operations on Images, Scanned Documents & PDFs to Extract Text from Images, Scanned Documents & PDFs via Tesseract PHP Library.

What is Tesseract OCR for PHP?

Out of all the OCR tools out there, Tesseract OCR is a top choice. It’s a robust API that empowers software developers to craft apps that can recognize and extract text from different visual sources. If you’re working with PHP, Tesseract OCR for PHP is a handy wrapper. This open-source library can boost OCR accuracy by preprocessing images. The Tesseract OCR for PHP library provides various advanced tools and settings to boost OCR outcomes within PHP apps. You can manage multilingual content, select preferred languages for better accuracy, and access different page segmentation modes, among other features. To make text clearer and get rid of any obstacles that could affect its recognition, you can use methods like resizing, binarization, noise reduction, and deskewing.

Utilize the Tesseract PHP wrapper to pass the preprocessed image to the Tesseract OCR engine. With just a couple of lines of PHP code it allows improving recognition accuracy for specialized applications, training support on custom fonts or symbols, or specific text patterns, enhance accessibility, document digitization, text analytics, data extraction and many more. The wrapper provides functions to execute OCR and retrieve the recognized text as a result. The extracted text may require additional post-processing steps such as spell-checking, formatting, or language-specific modifications. PHP libraries like Symfony/string or Text_LanguageDetect can be employed for these purposes. By integrating Tesseract OCR into your PHP projects, software developers can streamline document processing, automate data extraction, and unlock a new level of efficiency and accessibility into their applications.

At A Glance

An overview of Tesseract OCR for PHP features.

Features Overview

Perform OCR
Extract Text from Image
Add OCR Capabilities
Recognize Image text
Convet images of text
Recognized Font text
Search PDF
Other Languages
Create OCR apps
Save to browser
Extract Text
Multi-threading Support

Tesseract OCR for PHP

Tesseract OCR for PHP supports popular compression file formats listed below.

Reader

PNG, JPEG, BMP, TIFF, TGA, DICOM

Writer

PNG, JPEG, BMP, TIFF

Tesseract OCR for PHP

Platform Independence

Tesseract OCR for PHP only requires PHP Runtime.

PHP 5.1 and above.

Tesseract OCR for PHP

Getting Started with Tesseract OCR for PHP

The recommend way to install Tesseract OCR for PHP is using Composer. Please use the following command for a smooth installation.

Install Tesseract OCR for PHP via Composer

$ composer require thiagoalessio/tesseract_ocr

Install Tesseract OCR for PHP via Github

git clone https://github.com/thiagoalessio/tesseract-ocr-for-php.git

You can download the compiled shared library from Github repository.

Extract Text from Image inside PHP Apps

The open source Tesseract OCR for PHP library has provided some useful features for extracting text from images using PHP commands. The library offers different page segmentation modes to handle various layouts and text arrangements. Start the extraction process by loading the image or document that contains the text you want to extract. Utilize the Tesseract PHP wrapper to pass the preprocessed image to the Tesseract OCR engine. The wrapper provides functions to execute OCR and retrieve the recognized text as a result. The following example shows a basic process of loading an image and extracting text from it using PHP commands.

How to Load Image & Extract Text using PHP Code?

use TesseractOCR\TesseractOCR;

$imagePath = '/path/to/your/image.jpg';

$tesseract = new TesseractOCR($imagePath);
$tesseract->setLanguage('eng'); // Set the desired language for text recognition

$text = $tesseract->run();
echo $text;

Handling OCR Output inside PHP Apps

The open source Tesseract OCR for PHP library has included a very useful features for saving and working with OCR's output text inside PHP applications. It allow saving the out text in some popular formats like PDF, TXT, HTML, Word and many more. It allows to handle the recognized text extracted from the image. Depending on your application's requirements, you may need to further process or analyze the extracted text. Common tasks include data validation, text cleaning, spell checking, formatting, integrating with other systems for advanced processing or language-specific modifications. Software developers can easily analyze large volumes of text data extracted from documents, social media feeds, or customer feedback to derive insights, sentiment analysis, or topic modeling.

How to Retrieve Image Data, Size & Save It in PDF Format via PHP API ?

//Using Imagick
$data = $img->getImageBlob();
$size = $img->getImageLength();
//Using GD
ob_start();
// Note that you can use any format supported by tesseract
imagepng($img, null, 0);
$size = ob_get_length();
$data = ob_get_clean();

$ocr = new TesseractOCR();
$ocr->imageData($data, $size);
$ocr->run();


// Save the Output to PDF file

echo (new TesseractOCR('img.png'))
    ->configFile('pdf')
    ->setOutputFile('/PATH_TO_MY_OUTPUTFILE/searchable.pdf')
    ->run();