1. Products
  2.   OCR
  3.   Node.js
  4.   Node-Tesseract-OCR
 
  

Free Node.js API to Add OCR Capabilities to JS Projects.

Open Source Node.js OCR Library That Allows Programmers to Recognize & Extract Text from Various File Formats, including Images(JPEG, PNG), PDFs, and Documents for Free in Multi Languages.

What is Node-Tesseract-OCR?

In today's digital age, extracting text from images and documents has become a crucial task in various industries, including document management, data processing, and artificial intelligence. Optical Character Recognition (OCR) technology has made it possible to convert scanned documents, images, and PDFs into editable text formats. Node-Tesseract-OCR is an open-source API that incorporates the power of Tesseract OCR engine to provide a seamless and efficient way to perform OCR tasks in Node.js applications.

Node-Tesseract-OCR is a Node.js wrapper for the Tesseract OCR engine, allowing software developers to utilize Tesseract’s powerful text recognition features within a Node.js environment. The API is maintained at this GitHub repository and offers a range of functionalities that make it suitable for various use cases, from simple text extraction to more complex document processing tasks. Software developers can extract text from images and documents in multiple languages, making it a versatile tool for various applications.

The Node-Tesseract-OCR API provides advanced image processing capabilities, including image filtering, resizing, and cropping, to ensure that the extracted text is accurate and reliable. It supports over 100 languages, making it a versatile solution for OCR tasks in diverse environments. Software developers can extract text from images, PDFs, and documents, and return the extracted text in a variety of formats, such as JSON, XML, and plain text. It is designed to be lightweight, flexible, and easy to use, making it an ideal choice for developers who want to add OCR capabilities to their projects. With its advanced image processing capabilities, language support, and error handling mechanisms, it is an ideal choice for developers who want to add OCR capabilities to their projects.

Previous Next

Getting Started with Node-Tesseract-OCR

The recommend way to install Node-Tesseract-OCR is using npm. Please use the following command for a smooth installation

Install Node-Tesseract-OCR via npm

npm install node-tesseract-ocr 

You can also install it manually; download the latest release files directly from GitHub repository.

Text Extraction from Images in Node.js API

The open source Node-Tesseract-OCR library makes it easy for software developers to create applications that automatically extract text from images inside Node.js applications. It supports text extraction from scanned documents, PDFs, camera photos or photos of receipts. This can be useful for creating searchable archives, automating data entry, or processing large volumes of documents in sectors like finance and healthcare. Here is a simple example that shows how to programmatically extract text from images inside Node.js applications.

How to Extract Text from Images inside Node.js Environment?

const tesseract = require("node-tesseract-ocr");

tesseract.recognize("path/to/image.jpg")
  .then(text => {
    console.log("Recognized Text:", text);
  })
  .catch(error => {
    console.error("Error:", error.message);
  });

Better Image Preprocessing inside Node.js

Preprocessing images before applying OCR can significantly improve the accuracy of text recognition. The open source Node-Tesseract-OCR library allows for basic preprocessing techniques, such as resizing, binarization, and deskewing. These preprocessing steps can be implemented using additional Node.js libraries like sharp or jimp in conjunction with Node-Tesseract-OCR. The following example shows how software developers use preprocessing steps to improve recognition, especially with lower-quality images.

How to Apply Preprocessing Steps to Improve Recognition via Node.js API?

const sharp = require("sharp");
const tesseract = require("node-tesseract-ocr");

sharp("path/to/input.jpg")
  .resize(800, 600) // Resize the image
  .greyscale() // Convert to greyscale
  .toBuffer()
  .then(data => {
    return tesseract.recognize(data, { lang: "eng" });
  })
  .then(text => {
    console.log("Preprocessed Image Text:", text);
  })
  .catch(error => {
    console.error("Error:", error.message);
  });

Recognized Text in Multi-Languages

One of the standout features of Node-Tesseract-OCR is its extensive multi-language support. The Tesseract OCR library supports over 100 languages, making it an ideal choice for applications that need to process documents in various languages. Software developers can specify the language(s) they want Tesseract to use, improving recognition accuracy for non-English texts. Here is an example that shows how software developers can recognized text in French inside Node.js applications?

How to Recognized Text from Image in French via JavaScript API?

const config = {
  lang: "fra", // French language support
  oem: 1,
  psm: 3
};

tesseract.recognize("path/to/french-text-image.jpg", config)
  .then(text => {
    console.log("Recognized Text in French:", text);
  })
  .catch(error => {
    console.error("Error:", error.message);
  });