Open Source PHP Library for OCR operations on Images
Free PHP Optical Character Recognition API to perform OCR operations on Images, Scanned Documents & PDFs using Tesseract PHP library.
Among the numerous OCR tools available, Tesseract OCR stands out as one of the most powerful and versatile API enabling software developers to create applications for recognizing and extracting text from various popular visual sources. Tesseract OCR for PHP is a very useful wrapper to work with Tesseract OCR inside PHP applications. The open source Tesseract OCR for PHP library can enhance OCR accuracy by preprocessing the image. Techniques such as resizing, binarization, noise removal, and deskewing can be applied to enhance the text's visibility and remove any artifacts that may hinder recognition.
Tesseract OCR for PHP library offers several advanced features and customization options to enhance OCR results inside PHP applications such as handling multilingual documents, specifying the desired language(s) during OCR initialization to improve accuracy for specific languages, page segmentation modes support, improving recognition accuracy for specialized applications, training support on custom fonts or symbols, or specific text patterns, enhance accessibility, document digitization, text analytics, data extraction and many more.
Utilize the Tesseract PHP wrapper to pass the preprocessed image to the Tesseract OCR engine. The wrapper provides functions to execute OCR and retrieve the recognized text as a result. The extracted text may require additional post-processing steps such as spell-checking, formatting, or language-specific modifications. PHP libraries like Symfony/string or Text_LanguageDetect can be employed for these purposes. By integrating Tesseract OCR into your PHP projects, software developers can streamline document processing, automate data extraction, and unlock a new level of efficiency and accessibility into their applications.
Getting Started with Tesseract OCR for PHP
The recommend way to install Tesseract OCR for PHP is using Composer. Please use the following command for a smooth installation.
Install Tesseract OCR for PHP via Composer
$ composer require thiagoalessio/tesseract_ocr
Install Tesseract OCR for PHP via Github
git clone https://github.com/thiagoalessio/tesseract-ocr-for-php.git
You can download the compiled shared library from Github repository.
Extract Text from Image inside PHP Apps
The open source Tesseract OCR for PHP library has provided some useful features for extracting text from images using PHP commands. The library offers different page segmentation modes to handle various layouts and text arrangements. Start the extraction process by loading the image or document that contains the text you want to extract. Utilize the Tesseract PHP wrapper to pass the preprocessed image to the Tesseract OCR engine. The wrapper provides functions to execute OCR and retrieve the recognized text as a result. The following example shows a basic process of loading an image and extracting text from it using PHP commands.
How to Load Image & Extract Text using PHP Code?
use TesseractOCR\TesseractOCR;
$imagePath = '/path/to/your/image.jpg';
$tesseract = new TesseractOCR($imagePath);
$tesseract->setLanguage('eng'); // Set the desired language for text recognition
$text = $tesseract->run();
echo $text;
Handling OCR Output inside PHP Apps
The open source Tesseract OCR for PHP library has included a very useful features for saving and working with OCR's output text inside PHP applications. It allow saving the out text in some popular formats like PDF, TXT, HTML, Word and many more. It allows to handle the recognized text extracted from the image. Depending on your application's requirements, you may need to further process or analyze the extracted text. Common tasks include data validation, text cleaning, spell checking, formatting, integrating with other systems for advanced processing or language-specific modifications. Software developers can easily analyze large volumes of text data extracted from documents, social media feeds, or customer feedback to derive insights, sentiment analysis, or topic modeling.
Retrieve Image Data, Size & Save It in PDF Format via PHP API
//Using Imagick
$data = $img->getImageBlob();
$size = $img->getImageLength();
//Using GD
ob_start();
// Note that you can use any format supported by tesseract
imagepng($img, null, 0);
$size = ob_get_length();
$data = ob_get_clean();
$ocr = new TesseractOCR();
$ocr->imageData($data, $size);
$ocr->run();
// Save the Output to PDF file
echo (new TesseractOCR('img.png'))
->configFile('pdf')
->setOutputFile('/PATH_TO_MY_OUTPUTFILE/searchable.pdf')
->run();