1. Products
  2.   OCR
  3.   PHP
  4.   Tesseract OCR for PHP
 
  

Free PHP Library for OCR operations on Images & PDFs

Open Source PHP Optical Character Recognition API to Perform OCR Operations on Images, Scanned Documents & PDFs to Extract Text from Images, Scanned Documents & PDFs via Tesseract PHP Library.

What is Tesseract OCR for PHP?

For software developers, Tesseract OCR offers a robust API to extract text from various visual sources. When working with PHP, Tesseract OCR for PHP serves as an efficient wrapper. This open-source library enhances OCR accuracy through image preprocessing, including resizing, binarization, noise reduction, and deskewing. It provides advanced tools for multilingual content, custom language selection, and multiple page segmentation modes to optimize results within PHP applications.

Utilize this Tesseract PHP wrapper to pass processed images to the OCR engine. With minimal code, it improves recognition accuracy for specialized uses—supporting training on custom fonts, symbols, or specific text patterns. It enables document digitization, text analytics, data extraction, and accessibility enhancements. After extraction, employ libraries like Symfony/string for post-processing, such as spell-checking or formatting. Integrating Tesseract OCR into PHP projects streamlines document processing, automates data extraction, and unlocks new levels of efficiency.

Previous Next

Getting Started with Tesseract OCR for PHP

The recommend way to install Tesseract OCR for PHP is using Composer. Please use the following command for a smooth installation.

Install Tesseract OCR for PHP via Composer

$ composer require thiagoalessio/tesseract_ocr 

Install Tesseract OCR for PHP via Github

git clone https://github.com/thiagoalessio/tesseract-ocr-for-php.git 

You can download the compiled shared library from Github repository.

Extract Text from Image inside PHP Apps

The open source Tesseract OCR for PHP library has provided some useful features for extracting text from images using PHP commands. The library offers different page segmentation modes to handle various layouts and text arrangements. Start the extraction process by loading the image or document that contains the text you want to extract. Utilize the Tesseract PHP wrapper to pass the preprocessed image to the Tesseract OCR engine. The wrapper provides functions to execute OCR and retrieve the recognized text as a result. The following example shows a basic process of loading an image and extracting text from it using PHP commands.

How to Load Image & Extract Text using PHP Code?

use TesseractOCR\TesseractOCR;

$imagePath = '/path/to/your/image.jpg';

$tesseract = new TesseractOCR($imagePath);
$tesseract->setLanguage('eng'); // Set the desired language for text recognition

$text = $tesseract->run();
echo $text;

Handling OCR Output inside PHP Apps

The open source Tesseract OCR for PHP library has included a very useful features for saving and working with OCR's output text inside PHP applications. It allow saving the out text in some popular formats like PDF, TXT, HTML, Word and many more. It allows to handle the recognized text extracted from the image. Depending on your application's requirements, you may need to further process or analyze the extracted text. Common tasks include data validation, text cleaning, spell checking, formatting, integrating with other systems for advanced processing or language-specific modifications. Software developers can easily analyze large volumes of text data extracted from documents, social media feeds, or customer feedback to derive insights, sentiment analysis, or topic modeling.

How to Retrieve Image Data, Size & Save It in PDF Format via PHP API?

//Using Imagick
$data = $img->getImageBlob();
$size = $img->getImageLength();
//Using GD
ob_start();
// Note that you can use any format supported by tesseract
imagepng($img, null, 0);
$size = ob_get_length();
$data = ob_get_clean();

$ocr = new TesseractOCR();
$ocr->imageData($data, $size);
$ocr->run();


// Save the Output to PDF file

echo (new TesseractOCR('img.png'))
    ->configFile('pdf')
    ->setOutputFile('/PATH_TO_MY_OUTPUTFILE/searchable.pdf')
    ->run();

 English