Free Java Library for OCR Text Extraction & Document Analysis

Open Source Java OCR Library for Incorporating OCR capabilities into Java apps and allows Extracting Text from Images as well as Scanned Documents.

In today's digital age, Optical Character Recognition (OCR) has become an essential tool for extracting text from images and scanned documents. OCR technology enables the conversion of printed or handwritten text into machine-readable data, opening up numerous possibilities for document analysis, data extraction, and automation. Among the many OCR solutions available, Tess4J stands out as a powerful open-source library that combines the versatility of the Tesseract OCR engine with the simplicity of Java programming.

Tess4J library empowers Java developers to incorporate OCR capabilities seamlessly into their applications. It is a Java wrapper for Tesseract, an OCR engine originally developed by Hewlett-Packard and currently maintained by Google. Tess4J leverages Tesseract's OCR engine, renowned for its accuracy. It employs advanced algorithms and machine learning techniques to achieve reliable text extraction from images, ensuring high-quality results. It enables OCR integration in Java applications, making it compatible with different platforms, including Windows, Linux, and macOS.

Tess4J provides a straightforward and well-documented API, making it easy for developers to integrate OCR capabilities into their Java applications. Tess4J is a versatile and robust open-source library that empowers developers to integrate powerful OCR capabilities into their Java applications. With its support for multiple languages, image preprocessing features, PDF conversion capabilities, and confidence scoring system, Tess4J provides an efficient and reliable solution for text extraction and document analysis.

At A Glance

An overview of Tess4J features.

Features Overview

Perform OCR
Add OCR Capabilities
Recognize Image text
Convet images of text
Recognized Font text
Searc PDF
Over 100 Languages
Create OCR apps
Save to browser
Extract Text
Multi-threading Support

Tess4J

Tess4J supports popular image file formats listed below.

Reader

PNG, JPEG, BMP, TIFF, TGA, DICOM

Writer

PNG, JPEG, BMP, TIFF

Tess4J

Platform Independence

Tess4J can work with any Java-based programming language

Java

Tess4J

Getting Started with Tess4J

The recommend way to install Tess4J is using Maven. Please use the following command for a smooth installation.

Maven Dependency for Tess4J


<dependencies>
	<dependency>
	<groupId>net.sourceforge.tess4j</groupId>
	<artifactId>tess4j;/artifactId>
	<version>X.X.X</version>
    </dependency>
</dependencies>

Install Tess4J via GitHub

 git clone https://github.com/nguyenq/tess4j.git

You can also install it manually; download the latest release files directly from GitHub repository.

Content Extraction via Java API

The open source Tess4J library allows software developers to extract text from various types of images inside Java applications. The library enables the extraction of text from images, enabling applications to analyze and process the textual content. This capability finds applications in areas such as sentiment analysis, text summarization, and information retrieval. The library also makes it easy to load the Tesseract OCR engine, perform content extraction on the specified image, and print the extracted text to the console.

Perform Content Extraction using Java OCR Library

import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;

public class ContentExtractionExample {
    public static void main(String[] args) {
        // Path to the Tesseract OCR installation directory
        String tessDataPath = "path/to/tesseract";

        // Initialize Tesseract instance
        Tesseract tesseract = new Tesseract();
        tesseract.setDatapath(tessDataPath);

        try {
            // Set the language for OCR (e.g., "eng" for English)
            tesseract.setLanguage("eng");

            // Path to the image file for content extraction
            String imagePath = "path/to/image.jpg";

            // Perform content extraction
            String extractedText = tesseract.doOCR(new File(imagePath));
            System.out.println(extractedText);
        } catch (TesseractException e) {
            e.printStackTrace();
        }
    }
}

PDF Conversion to Plain Text via Java API

The open source Tess4J library has provided complete functionality for loading and converting PDF documents into a plain text inside Java applications. Tess4J can convert searchable PDF documents into plain text, enabling developers to extract content from PDF files and perform further analysis or data processing. The following example shows, how software developers can convert an existing PDF file into plain text inside Java applications.

How to Convert an Existing PDF File into Plain Text?

import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.IOException;

public class PDFToTextConverter {
    public static void main(String[] args) {
        // Path to the PDF file
        String filePath = "path/to/your/pdf/file.pdf";

        try {
            // Load the PDF document
            PDDocument document = PDDocument.load(new File(filePath));

            // Create an instance of Tesseract OCR engine
            Tesseract tesseract = new Tesseract();

            // Set the path to the tessdata directory (containing language data)
            tesseract.setDatapath("path/to/your/tessdata/directory");

            // Iterate over each page of the PDF document
            for (int pageIndex = 0; pageIndex < document.getNumberOfPages(); pageIndex++) {
                // Extract the text from the current page
                PDFTextStripper stripper = new PDFTextStripper();
                stripper.setStartPage(pageIndex + 1);
                stripper.setEndPage(pageIndex + 1);
                String pageText = stripper.getText(document);

                // Perform OCR on the extracted text
                String ocrText = tesseract.doOCR(pageText);

                // Output the OCR result
                System.out.println("Page " + (pageIndex + 1) + " OCR Result:");
                System.out.println(ocrText);
                System.out.println("--------------------------------------");
            }

            // Close the PDF document
            document.close();
        } catch (IOException | TesseractException e) {
            e.printStackTrace();
        }
    }
}