Open Source .NET API for OCR To Process Text & Images
Open Source .NET Optical Character Recognition (OCR) API used to Convert Images (Scanned Images & PDF Files) Containing Text into Machine-Readable Text.
What is Free-OCR-API-CSharp?
For software engineers and web developers in search of a cost-effective yet powerful solution, Tesseract stands out as an exceptional open source .NET OCR API. This robust engine seamlessly integrates into your development ecosystem, allowing you to perform OCR on scanned images and effortlessly convert image to text directly within your custom C# applications. Designed for versatility, the library is engineered to accurately recognize text on image files across a wide spectrum of standard formats, such as JPEG, PNG, TIFF, and BMP. By democratizing access to advanced recognition technology, Tesseract serves as an ideal foundational tool for document digitization and automated data extraction, enabling developers to build high-performance text recognition workflows without the overhead of expensive proprietary software.
The true power of the TesseractOCR C# API lies in its reliability and comprehensive feature set, backed by a lineage that traces from Hewlett-Packard to its current maintenance by Google. It excels in complex processing tasks, offering the capability to convert PDFs containing text into machine-readable text and transform raw data by helping you convert scanned images to machine-readable text with impressive accuracy. With native support for over 100 languages, this engine is essential for creating searchable archives and optimizing global content management strategies. Its high degree of customizability empowers developers to fine-tune performance for specific use cases, making it the premier choice for constructing scalable, efficient applications that demand precise text recognition capabilities.
Getting Started with Tesseract
The recommend way to install Tesseract is using NuGet. Please use the following command for a smooth installation.
Install Tesseract via NuGet
Install-Package Tesseract Install Tesseract via GitHub
git clone https://github.com/charlesw/tesseract.git Extract Basic Text from an Image via C#
The open source C# library Tesseract enables software developers to extract text from an image inside their own .NET applications. The library makes it easy for software developers to easily retrieve the text content of scanned documents or images, and use it for further processing or analysis. To achieve the task first developers need to import the Tesseract namespace in your code file and create an instance of the Tesseract engine. The following example shows how to extract the basic text from the image and output it to the console.
How to Extract the Basic Text from Image via C# API?
using Tesseract;
using System.Drawing;
namespace MyNamespace
{
class Program
{
static void Main(string[] args)
{
var engine = new TesseractEngine("./tessdata", "eng", EngineMode.Default);
var image = new Bitmap(@"C:\path\to\your\image.jpg");
var page = engine.Process(image);
var text = page.GetText();
image.Dispose();
page.Dispose();
engine.Dispose();
Console.WriteLine(text);
}
}
}
Convert Image to Searchable PDF via C# .NET
The open source C# library Tesseract has included some useful features for converting images to searchable PDF documents using C# code. The library also has included support for various output formats, such as plain text, hOCR (HTML), PDF, invisible-text-only PDF, TSV, ALTO and many more. Please remember that to get better OCR results, developer’s need to improve the quality of the images they are going to provide to Tesseract. The following example shows how to create a searchable PDF document containing the recognized text from the image.
How to Convert Image to Searchable PDF using C# .NET?
using (var engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default))
{
using (var img = Pix.LoadFromFile(testImagePath))
{
using (var page = engine.Process(img))
{
var text = page.GetText();
Console.WriteLine("Mean confidence: {0}", page.GetMeanConfidence());
Console.WriteLine("Text (GetText): \r\n{0}", text);
Console.WriteLine("Text (iterator):");
}
}
}
FORMAT_PLAINTEXT);