Free Ruby Library to Convert Image to Text & Searchable PDFs

Open Source Ruby OCR Library That Enables Software Developers to Perform Optical Character Recognition to Extract Text from Scanned Documents, Images, or Even Screenshots

Optical Character Recognition (OCR) is a transformative technology that enables machines to extract text from images, scanned documents, and other visual media. For Ruby developers looking to harness this power, the open source RTesseract library offers an accessible, efficient gateway to integrate Tesseract OCR into your applications. RTesseract is a Ruby library that acts as an interface to the Tesseract OCR engine—one of the most established and accurate open source OCR tools available. Originally developed by Hewlett-Packard and later supported by Google, Tesseract has become the go-to solution for image-to-text conversion in various projects. RTesseract simplifies the interaction with Tesseract, allowing developers to incorporate OCR capabilities in Ruby projects without needing to manage the complexities of the command-line tool directly.

RTesseract is a powerful and flexible Ruby library that simplifies the process of extracting text from images and is designed to work seamlessly with Ruby applications. It supports a variety of image formats, including PNG, JPEG, BMP, and TIFF. This flexibility ensures that you can work with almost any type of image file. It supports multiple languages by leveraging Tesseract’s language data files. You can specify the language of the text in the image to improve recognition accuracy. As an open-source project, RTesseract is free to use and modify. In addition to extracting text, the library can also provide confidence scores for each recognized word. This feature is useful for evaluating the accuracy of the OCR results.

At A Glance

An overview of RTesseract features.

Features Overview

Convert Image to Text
Add OCR Capabilities
Recognize Image text
Load Images via URL
Convert PDF tp text
Recognized Font text
Image to Searchable PDF
Other Languages
Create OCR apps
Save to browser
Extract Text
Multi-threading Support

RTesseract

RTesseract supports popular compression file formats listed below.

Reader

PNG, JPEG, BMP, TIFF, TGA, DICOM

Writer

PNG, JPEG, BMP, TIFF

RTesseract

Platform Independence

RTesseract only requires Ruby Runtime.

Ruby 5.1 and above.

RTesseract

Getting Started with RTesseract

The recommend way to install RTesseract is using Rubygems. Please use the following command for a smooth installation.

Install RTesseract via Rubygems

$ gem install rtesseract

Install RTesseract via GitHub

 git clone https://github.com/dannnylo/rtesseract.git

You can download the compiled shared library from GitHub repository.

Image to Text Conversion via Ruby API

The RTesseract library makes it easy for developers to load and convert an image to text inside Ruby applications. The most straightforward use case is converting an image into a string of text. With just a few lines of code, you can extract text from an image file. This following code example loads the image and processes it with Tesseract, returning the recognized text as a Ruby string using Ruby commands.

How to Load an Image and Convert It to Text via Ruby API?

require 'rtesseract'
image = RTesseract.new("path/to/your_image.jpg")
text = image.to_s
puts "Extracted Text: #{text}"

Image Conversion to Searchable PDF via Ruby

The open source RTesseract library has provided complete support for converting an image to a searchable PDF, preserving the image’s layout and colors inside Ruby applications. the following example demonstrates how software developers can load generate a searchable PDF document from an images using Ruby commands.

How to Convert a JPEG Image to Searchable PDF File via Ruby Library?

require 'rtesseract'
image = RTesseract.new("path/to/my_image.jpg")
pdf_file = image.to_pdf  # Returns an open file handle for the PDF
File.write("output.pdf", pdf_file.read)

Custom Configuration for Tesseract

The open source RTesseract library allows software professionals to configure Tesseract’s settings, such as language, page segmentation mode, restrict OCR to digit recognition and OCR engine mode. This enables you to fine-tune the OCR process for better accuracy. You can customize Tesseract’s settings using the config option. In the following example, psm (page segmentation mode) is set to 6, and oem (OCR engine mode) is set to 1.

How to Customize Tesseract’s settings inside Ruby Apps?

image = RTesseract.new('path/to/image.png', config: { psm: 6, oem: 1 })
text = image.to_s

puts text

Multi-Language Support

If your image contains text in a specific language, you can improve accuracy by specifying the language. The library supports languages like English (default), German, French, Italian, Dutch, Portuguese, Spanish, Vietnamese, and so on. Please make sure that the corresponding language pack is installed with Tesseract. In the following example the lang option is set to 'fra' for French. You can use any language code supported by Tesseract

How to Convert Image to Text in Other Languages via Ruby Library?

image = RTesseract.new('path/to/image.png', lang: 'fra') # French language
text = image.to_s

puts text