1. Products
  2.   OCR
  3.   Ruby
  4.   Ruby-Tesseract-OCR
 
  

Free Ruby Library to Load & Extract Text from Images

Open Source Ruby OCR API that allows Software Developers to Load, Recognize and Extract Text from Images (scanned images & PDF files)

What is Ruby-Tesseract-OCR?

Optical Character Recognition (OCR) is an essential technology that empowers computers to extract and digitize text from images, PDFs, and scanned documents. For developers working in the Ruby programming language, the Ruby-Tesseract-OCR gem provides a seamless and powerful solution. This open-source library acts as an efficient Ruby wrapper for the renowned Tesseract OCR engine, developed by Google, which is celebrated for its high accuracy and extensive language support. By integrating this gem, Ruby developers can effortlessly add advanced OCR capabilities to automate data entry, digitize printed materials, and streamline document processing.

The Ruby-Tesseract-OCR library offers sophisticated features for complex use cases, moving beyond basic text extraction. Software developers can specify a Region of Interest (ROI) to confine OCR analysis to a particular section of an image, which is invaluable for processing intricate documents. Additional functionalities include loading existing image files, extracting text from various sources, and generating HOCR (HTML OCR) output for structured data. With its intuitive interface, this reliable open source tool simplifies integrating the Tesseract engine, making it ideal for projects like invoice processing, content digitization, and workflow automation. Unlock the potential of automated text recognition by incorporating the Ruby-Tesseract-OCR gem into your development toolkit today.

Previous Next

Getting Started with Ruby-Tesseract-OCR

The recommend way to install Ruby-Tesseract-OCR is using Rubygems. Please use the following command for a smooth installation.

Install Ruby-Tesseract-OCR via Rubygems

gem install tesseract-ocr 

You can download the compiled shared library from Github repository.

Extract Text from Images & Scanned Documents via Ruby

Ruby-Tesseract-OCR is a very powerful open source library that allows software developers to load and extract text from various types of images with just a couple of lines of Ruby code. The library makes it easy to extract text from images, PDFS or scanned documents. The typical workflow involves loading an image, configuring the OCR parameters, and invoking the OCR engine to recognize the text. For a successful operation developers needs to provide the path to the image they want to process and call the text_for method to extract the text. Finally, the result will be printed to the console. The library offers various Fconfiguration options for controlling OCR behavior, such as page segmentation mode, whitelist characters, and more. The following examples shows how software developers can load a JPEG image and extract text from it inside Ruby applications.

How to Extract Text from Images using Ruby Commands?

require 'tesseract'

e = Tesseract::Engine.new {|e|
  e.language  = :eng
  e.blacklist = '|'
}

e.text_for('test/first.png').strip # => 'ABC'

Extract Text from a Particular Image Area via Ruby

The open source Ruby-Tesseract-OCR library goes beyond basic OCR capabilities and offers additional features for advanced use cases. For instance, users can specify a region of interest (ROI) within an image to limit the OCR analysis to a specific area. This is particularly useful when dealing with complex documents or when users only need to extract text from a specific section. Additionally, the library provides methods for obtaining HOCR (HTML OCR) output, which includes not only the recognized text but also information about the layout and coordinates of the text elements. HOCR output is helpful when you need more granular data or want to perform further analysis on the text structure.

How to perform hOCR on an Image via Ruby Library?

require 'tesseract'

e = Tesseract::Engine.new {|e|
  e.language  = :eng
  e.blacklist = '|'
}

puts e.hocr_for('test/first.png')
 English