Free Java Library for OCR Text Extraction & Document Analysis
Open Source Java OCR Library for Incorporating OCR capabilities into Java apps and allows Extracting Text from Images as well as Scanned Documents.
In today's digital age, Optical Character Recognition (OCR) has become an essential tool for extracting text from images and scanned documents. OCR technology enables the conversion of printed or handwritten text into machine-readable data, opening up numerous possibilities for document analysis, data extraction, and automation. Among the many OCR solutions available, Tess4J stands out as a powerful open-source library that combines the versatility of the Tesseract OCR engine with the simplicity of Java programming.
Tess4J library empowers Java developers to incorporate OCR capabilities seamlessly into their applications. It is a Java wrapper for Tesseract, an OCR engine originally developed by Hewlett-Packard and currently maintained by Google. Tess4J leverages Tesseract's OCR engine, renowned for its accuracy. It employs advanced algorithms and machine learning techniques to achieve reliable text extraction from images, ensuring high-quality results. It enables OCR integration in Java applications, making it compatible with different platforms, including Windows, Linux, and macOS.
Tess4J provides a straightforward and well-documented API, making it easy for developers to integrate OCR capabilities into their Java applications. Tess4J is a versatile and robust open-source library that empowers developers to integrate powerful OCR capabilities into their Java applications. With its support for multiple languages, image preprocessing features, PDF conversion capabilities, and confidence scoring system, Tess4J provides an efficient and reliable solution for text extraction and document analysis.
Getting Started with Tess4J
The recommend way to install Tess4J is using Maven. Please use the following command for a smooth installation.
Maven Dependency for Tess4J
<dependencies>
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j;/artifactId>
<version>X.X.X</version>
</dependency>
</dependencies>
Install Tess4J via GitHub
git clone https://github.com/nguyenq/tess4j.git
You can also install it manually; download the latest release files directly from GitHub repository.
Content Extraction via Java API
The open source Tess4J library allows software developers to extract text from various types of images inside Java applications. The library enables the extraction of text from images, enabling applications to analyze and process the textual content. This capability finds applications in areas such as sentiment analysis, text summarization, and information retrieval. The library also makes it easy to load the Tesseract OCR engine, perform content extraction on the specified image, and print the extracted text to the console.
Perform Content Extraction using Java OCR Library
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
public class ContentExtractionExample {
public static void main(String[] args) {
// Path to the Tesseract OCR installation directory
String tessDataPath = "path/to/tesseract";
// Initialize Tesseract instance
Tesseract tesseract = new Tesseract();
tesseract.setDatapath(tessDataPath);
try {
// Set the language for OCR (e.g., "eng" for English)
tesseract.setLanguage("eng");
// Path to the image file for content extraction
String imagePath = "path/to/image.jpg";
// Perform content extraction
String extractedText = tesseract.doOCR(new File(imagePath));
System.out.println(extractedText);
} catch (TesseractException e) {
e.printStackTrace();
}
}
}
PDF Conversion to Plain Text via Java API
The open source Tess4J library has provided complete functionality for loading and converting PDF documents into a plain text inside Java applications. Tess4J can convert searchable PDF documents into plain text, enabling developers to extract content from PDF files and perform further analysis or data processing. The following example shows, how software developers can convert an existing PDF file into plain text inside Java applications.
How to Convert an Existing PDF File into Plain Text?
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import java.io.File;
import java.io.IOException;
public class PDFToTextConverter {
public static void main(String[] args) {
// Path to the PDF file
String filePath = "path/to/your/pdf/file.pdf";
try {
// Load the PDF document
PDDocument document = PDDocument.load(new File(filePath));
// Create an instance of Tesseract OCR engine
Tesseract tesseract = new Tesseract();
// Set the path to the tessdata directory (containing language data)
tesseract.setDatapath("path/to/your/tessdata/directory");
// Iterate over each page of the PDF document
for (int pageIndex = 0; pageIndex < document.getNumberOfPages(); pageIndex++) {
// Extract the text from the current page
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(pageIndex + 1);
stripper.setEndPage(pageIndex + 1);
String pageText = stripper.getText(document);
// Perform OCR on the extracted text
String ocrText = tesseract.doOCR(pageText);
// Output the OCR result
System.out.println("Page " + (pageIndex + 1) + " OCR Result:");
System.out.println(ocrText);
System.out.println("--------------------------------------");
}
// Close the PDF document
document.close();
} catch (IOException | TesseractException e) {
e.printStackTrace();
}
}
}