Open Source .NET API for OCR To Process Text & Images

Open Source .NET Optical Character Recognition (OCR) API used to convert images (scanned images & PDF files) containing text into machine-readable text.

Tesseract is a very powerful open source optical character recognition (OCR) engine that enables software developers to convert various types of images containing text into machine-readable text inside Python applications. Open source technology has revolutionized the way software developers build their applications by making it easier for them to access and integrate powerful tools and libraries inside their applications. It is a .NET wrapper for tesseract-ocr and can be used in a wide range of applications, from document scanning and data extraction to automated image recognition and translation.

Tesseract was originally developed in the 1980s by Hewlett-Packard and was later released as an open source project in 2005. Since then, it has become one of the most widely used OCR engines in the world, with support for Unicode (UTF-8), over 100 languages, and the ability to process a wide range of image formats. There are various features part of the API such as document scanning, document digitization, making documents searchable, creating machine-readable documents, optimizing OCR performance, and many more.

Tesseract is very easy to handle and is designed to recognize text within digital images in a wide range of image formats, such as JPEG, BMP, PSD, PNG, TIFF, and many more. The library is highly customizable, with a wide range of options that can be used to optimize OCR performance for different types of images and text. Whether you're working on document scanning and digitization, data extraction, or image recognition and translation, Tesseract offers a powerful and reliable solution that can help you achieve your goals quickly and easily.

At A Glance

An overview of Tesseract features.

Features Overview

Perform OCR
Add OCR Capabilities
Recognize Image text
Convet images of text
Recognized Font text
Searc PDF
Over 100 Languages
Create OCR apps
Save to browser
Extract Text
Multi-threading Support

Tesseract

Tesseract supports popular image file formats listed below.

Reader

PNG, JPEG, BMP, TIFF, TGA, DICOM

Writer

PNG, JPEG, BMP, TIFF

Tesseract

Platform Independence

Tesseract can work with any .NET programming language

.NET Framework 4.8

Tesseract

Getting Started with Tesseract

The recommend way to install Tesseract is using NuGet. Please use the following command for a smooth installation.

Install Tesseract via NuGet

 Install-Package Tesseract

Install Tesseract via GitHub

 git clone https://github.com/charlesw/tesseract.git

Extract Basic Text from an Image via C#

The open source C# library Tesseract enables software developers to extract text from an image inside their own .NET applications. The library makes it easy for software developers to easily retrieve the text content of scanned documents or images, and use it for further processing or analysis. To achieve the task first developers need to import the Tesseract namespace in your code file and create an instance of the Tesseract engine. The following example shows how to extract the basic text from the image and output it to the console.

How to Extract the Basic Text from Image via C# API?

using Tesseract;
using System.Drawing;

namespace MyNamespace
{
    class Program
    {
        static void Main(string[] args)
        {
            var engine = new TesseractEngine("./tessdata", "eng", EngineMode.Default);
            var image = new Bitmap(@"C:\path\to\your\image.jpg");
            var page = engine.Process(image);
            var text = page.GetText();
            image.Dispose();
            page.Dispose();
            engine.Dispose();
            Console.WriteLine(text);
        }
    }
}

Convert Image to Searchable PDF via C# .NET

The open source C# library Tesseract has included some useful features for converting images to searchable PDF documents using C# code. The library also has included support for various output formats, such as plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV, ALTO and many more. Please remember that to get better OCR results, developer’s need to improve the quality of the images they are going to provide to Tesseract. The following example shows how to create a searchable PDF document containing the recognized text from the image.

How to Convert Image to Searchable PDF using C# .NET

using (var engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default))
    {
        using (var img = Pix.LoadFromFile(testImagePath))
        {
            using (var page = engine.Process(img))
            {
                var text = page.GetText();
                Console.WriteLine("Mean confidence: {0}", page.GetMeanConfidence());

                Console.WriteLine("Text (GetText): \r\n{0}", text);
                Console.WriteLine("Text (iterator):");
                }
        }
    }
FORMAT_PLAINTEXT);