1. Products
  2.   OCR
  3.   Java
  4.   Tess4J
 
  

Free Java Library for OCR Text Extraction & Document Analysis

Open Source Java OCR Library for Incorporating OCR capabilities into Java apps and allows Extracting Text from Images as well as Scanned Documents.

For software developers seeking a powerful and accessible Java OCR API, Tess4J emerges as a premier open source Java OCR library that seamlessly integrates the renowned Tesseract engine into Java applications. This robust solution is specifically designed to perform OCR operations on JPEG, BMP & PNG files, as well as a wide range of other image formats. It empowers applications to recognize text from images with high accuracy, converting static pictures into machine-readable data. This capability is fundamental for automating data entry, digitizing archives, and building intelligent systems that require Java optical character recognition to process scanned documents and digital images efficiently.

A key strength of this free Java OCR library is its extensive format support and versatility. Developers can leverage its straightforward API to extract text from PNG images, perform OCR operations on GIF files, and handle numerous other formats with ease. Built as a Java wrapper for Google's Tesseract engine, Tess4J provides exceptional cross-platform compatibility and supports advanced features like image preprocessing and multi-language recognition to ensure reliable text extraction from even low-quality scans. This makes it an indispensable tool for any project that requires accurate and efficient conversion of visual information into actionable textual data.

Previous Next

Getting Started with Tess4J

The recommend way to install Tess4J is using Maven. Please use the following command for a smooth installation.

Maven Dependency for Tess4J


<dependencies>
	<dependency>
	<groupId>net.sourceforge.tess4j</groupId>
	<artifactId>tess4j;/artifactId>
	<version>X.X.X</version>
    </dependency>
</dependencies>

Install Tess4J via GitHub

 git clone https://github.com/nguyenq/tess4j.git  

You can also install it manually; download the latest release files directly from GitHub repository.

Content Extraction via Java API

The open source Tess4J library allows software developers to extract text from various types of images inside Java applications. The library enables the extraction of text from images, enabling applications to analyze and process the textual content. This capability finds applications in areas such as sentiment analysis, text summarization, and information retrieval. The library also makes it easy to load the Tesseract OCR engine, perform content extraction on the specified image, and print the extracted text to the console.

How to Perform Content Extraction using Java OCR Library?

import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;

public class ContentExtractionExample {
    public static void main(String[] args) {
        // Path to the Tesseract OCR installation directory
        String tessDataPath = "path/to/tesseract";

        // Initialize Tesseract instance
        Tesseract tesseract = new Tesseract();
        tesseract.setDatapath(tessDataPath);

        try {
            // Set the language for OCR (e.g., "eng" for English)
            tesseract.setLanguage("eng");

            // Path to the image file for content extraction
            String imagePath = "path/to/image.jpg";

            // Perform content extraction
            String extractedText = tesseract.doOCR(new File(imagePath));
            System.out.println(extractedText);
        } catch (TesseractException e) {
            e.printStackTrace();
        }
    }
}

PDF Conversion to Plain Text via Java API

The open source Tess4J library has provided complete functionality for loading and converting PDF documents into a plain text inside Java applications. Tess4J can convert searchable PDF documents into plain text, enabling developers to extract content from PDF files and perform further analysis or data processing. The following example shows, how software developers can convert an existing PDF file into plain text inside Java applications.

How to Convert an Existing PDF File into Plain Text?

import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.IOException;

public class PDFToTextConverter {
    public static void main(String[] args) {
        // Path to the PDF file
        String filePath = "path/to/your/pdf/file.pdf";

        try {
            // Load the PDF document
            PDDocument document = PDDocument.load(new File(filePath));

            // Create an instance of Tesseract OCR engine
            Tesseract tesseract = new Tesseract();

            // Set the path to the tessdata directory (containing language data)
            tesseract.setDatapath("path/to/your/tessdata/directory");

            // Iterate over each page of the PDF document
            for (int pageIndex = 0; pageIndex < document.getNumberOfPages(); pageIndex++) {
                // Extract the text from the current page
                PDFTextStripper stripper = new PDFTextStripper();
                stripper.setStartPage(pageIndex + 1);
                stripper.setEndPage(pageIndex + 1);
                String pageText = stripper.getText(document);

                // Perform OCR on the extracted text
                String ocrText = tesseract.doOCR(pageText);

                // Output the OCR result
                System.out.println("Page " + (pageIndex + 1) + " OCR Result:");
                System.out.println(ocrText);
                System.out.println("--------------------------------------");
            }

            // Close the PDF document
            document.close();
        } catch (IOException | TesseractException e) {
            e.printStackTrace();
        }
    }
}

 English