1. Products
  2.   PDF
  3.   Java
  4.   Tabula-Java
 
  

Free Java PDF Library to Extract Tables from PDFs

Open Source Java API that Programmatically Extracts Tables from PDF Files. It can Easily Convert Extracted Tables into CSV and Transforms Locked PDF Tables into Structured Data Formats.

What is Tabula-Java?

PDF files are everywhere in the business world—reports, invoices, research papers, and financial statements. But there's a persistent problem: extracting tabular data from these documents is notoriously difficult. Copy-pasting often results in mangled data, and manual retyping is tedious and error-prone. That's where Tabula-Java comes in. Tabula-Java is a powerful open source library that programmatically extracts tables from PDF documents. Built on solid PDF processing foundations, this Java library transforms locked PDF tables into structured data formats you can actually work with. Whether you're building data pipelines, automating report processing, or simply tired of wrestling with PDF tables, Tabula-Java offers an elegant solution.

Unlike generic PDF parsers that treat everything as unstructured text, Tabula-Java understands table structures. It recognizes rows, columns, and cells, preserving the relationships between data points. This distinction is crucial—the difference between getting "John 42 Engineer" as a blob of text versus structured records with proper field separation. The library operates as a pure Java implementation, making it perfect for enterprise applications, backend services, and any system where you need reliable, automated PDF table extraction without manual intervention. It's the engine behind the popular Tabula desktop application, proven through years of real-world use by journalists, researchers, and data analysts worldwide. Its intelligent detection algorithms, flexible extraction modes, and straightforward API make it an essential tool for anyone working with tabular data trapped in PDFs.

Previous Next

Getting Started with Tabula-Java

Please use the following command to add the maven dependency in your project.

OpenPDF Maven Dependency

<dependency>
  <groupId>technology.tabula</groupId>
  <artifactId>tabula</artifactId>
  <version>1.0.5.</version>
</dependency>

Install Tabula-Java via Gradle

implementation 'technology.tabula:tabula:1.0.5' 

Basic Table Extraction via Java API

The simplest way to use the open source Tabula-Java library is to let it automatically detect all tables in a PDF using its default detection algorithm. This works well for PDFs with clear, spreadsheet-like tables. Let's start with a straightforward example that extracts all tables from a PDF document. This example demonstrates the fundamental workflow. We load the PDF using Apache PDFBox (Tabula-Java's underlying PDF engine), create an extraction algorithm, and iterate through pages.

How to Extract Table from PDF via Java API?

import technology.tabula.*;
import technology.tabula.extractors.SpreadsheetExtractionAlgorithm;
import org.apache.pdfbox.pdmodel.PDDocument;

import java.io.File;
import java.io.IOException;
import java.util.List;

public class BasicExtraction {
    public static void main(String[] args) throws IOException {
        // Load the PDF document
        File pdfFile = new File("data_report.pdf");
        PDDocument document = PDDocument.load(pdfFile);
        
        // Create an extraction algorithm
        SpreadsheetExtractionAlgorithm extractor = new SpreadsheetExtractionAlgorithm();
        
        // Create a page iterator
        ObjectExtractor objectExtractor = new ObjectExtractor(document);
        PageIterator pages = objectExtractor.extract();
        
        // Process each page
        while (pages.hasNext()) {
            Page page = pages.next();
            
            // Extract tables from the page
            List<Table> tables = extractor.extract(page);
            
            // Print each table
            for (Table table : tables) {
                System.out.println("Found table with " + table.getRowCount() + " rows");
                
                // Iterate through rows
                for (List row : table.getRows()) {
                    for (RectangularTextContainer cell : row) {
                        System.out.print(cell.getText() + "\t");
                    }
                    System.out.println();
                }
            }
        }
        
        document.close();
    }
}


Extract Table from Specific Page Regions

Often, you don't want to extract everything. You might need data from a specific table in a specific location on the page. Tabula-Java allows you to define a "selection area" using coordinates. The Rectangle class defines a specific area on the page using PDF coordinate space (where 72 points equal one inch). The getArea() method crops the page to this region before extraction, improving accuracy and performance when you only need data from a specific location. Here is a very useful example that demonstrates how to perform this operation.

How to Extract Table from a Specific Page Regions using Java API?

import technology.tabula.Rectangle;

public class RegionExtraction {
    public static void extractRegion(String pdfPath) throws IOException {
        File pdfFile = new File(pdfPath);
        PDDocument document = PDDocument.load(pdfFile);
        
        ObjectExtractor objectExtractor = new ObjectExtractor(document);
        Page page = objectExtractor.extract(1);
        
        // Define the region (x, y, width, height)
        // Coordinates are in points (72 points = 1 inch)
        Rectangle region = new Rectangle(50, 100, 500, 400);
        Page subPage = page.getArea(region);
        
        SpreadsheetExtractionAlgorithm extractor = new SpreadsheetExtractionAlgorithm();
        List<Table> tables = extractor.extract(subPage);
        
        // Process the extracted table
        for (Table table ; tables) {
            System.out.println("Extracted " + table.getRowCount() + " rows from region");
        }
        
        document.close();
    }
}

Converting Tables to CSV via Java API

A common use case is exporting extracted tables directly to CSV format for use in spreadsheet applications or databases. The CSVWriter class handles all the complexity of CSV formatting, including proper escaping of special characters and handling of multi-line cell content. This makes it trivial to integrate Tabula-Java into data processing pipelines that expect CSV input. Here's a Java code example that shows how to handle tables efficiently inside Java applications.

How to Export Extracted Tables to CSV Format via Java API?

import technology.tabula.writers.CSVWriter;
import java.io.FileWriter;

public class CSVExtraction {
    public static void exportToCSV(String pdfPath, String csvPath) throws IOException {
        File pdfFile = new File(pdfPath);
        PDDocument document = PDDocument.load(pdfFile);
        
        SpreadsheetExtractionAlgorithm extractor = new SpreadsheetExtractionAlgorithm();
        ObjectExtractor objectExtractor = new ObjectExtractor(document);
        Page page = objectExtractor.extract(1);
        
        List<Table< tables = extractor.extract(page);
        
        // Write to CSV file
        if (!tables.isEmpty()) {
            Table table = tables.get(0);
            
            try (FileWriter writer = new FileWriter(csvPath)) {
                CSVWriter csvWriter = new CSVWriter();
                csvWriter.write(writer, table);
            }
        }
        
        document.close();
    }
}

Processing Multiple Pages via Java API

Many real-world PDFs contain tables spread across multiple pages. The library has included support for processing multiple pages at the same time and allows developers to iterates through all pages in the document, extracting and reporting on tables found. The page numbering starts at 1 (not 0), which matches how humans typically reference page numbers.