Free C# .NET API for PDF Text Extraction and Analysis

A Leading Open Source .NET Library for Extract Text, Images, Netadata, and PDF Structure Information from PDF Files with ease. Add new Pages, Tables, Text and Shapes to PDFs.

What is PdfPig Library?

PdfPig is an open-source .NET library designed for reading, parsing, and extracting content from PDF files in C#. It is developed as a port of Apache PDFBox for the .NET ecosystem and is widely used for PDF text extraction, document analysis, and low-level PDF processing. It supports .NET Standard and multiple .NET versions, making it highly flexible for modern development environments. The library is very useful for document indexing, search engines, AI/ML document preprocessing, OCR post-processing, invoice automation, PDF accessibility tools and data extraction systems.

PDF files are everywhere—contracts, reports, invoices, eBooks, bank statements, research papers, and forms. Extracting structured information from PDFs is often difficult because PDF is primarily a presentation format, not a semantic document format. PdfPig simplifies this problem by giving developers access to raw text and words, character positions and bounding boxes, fonts and styling information, images and metadata and document layout details. Unlike many PDF libraries that mainly focus on generating PDFs, PdfPig is specialized in reading existing PDF documents. This makes it an excellent choice for developers building search engines, OCR pipelines, invoice parsers, document analyzers, compliance tools, or content extraction systems.

At A Glance

An overview of PdfPig features.

Features Overview

PDF Text Extraction
Read PDF
Layout Analysis
OCR Post-Processing
Metadata Reading
Add Images
Add Watermarks
Document Automation
TIFF to PDF
Insert Text
Add Shapes
Invoice Parsing
Search Indexing

PdfPig

PdfPig supports formats listed below.

Reader

Writer

PdfPig

Platform Independence

PdfPig can work with .NETFramework 4.6.1, .NETStandard 2.0, or .NETStandard 2.1.

Getting Started with PdfPig

The PdfPig library is available as a nuget package. So it is highly recommend using NuGet to install PdfPig to your project. Please use the following command for successful installation.

Install PdfPig via NuGet Package Manager

// Package Manager
Install-Package PdfPig

// .NET CLI

dotnet add package PdfPig

You can also install it manually; download the latest release files directly from GitHub repository.

Extract Text from PDF Documents via .NET

One of PdfPig’s most common use cases is extracting plain text from PDF documents. Software developers can open a PDF and read text page by page or extract the entire content. This is especially useful for search indexing, keyword scanning, and content analysis. Here is a simple example that demonstrates how software developers can extract text from PDF documents inside C# .NET applications. First it opens PDF file and iterates through all pages after that reads extracted text content.

How to Extract Text from PDF Documents via C# API?

using UglyToad.PdfPig;

using (var document = PdfDocument.Open("sample.pdf"))
{
    foreach (var page in document.GetPages())
    {
        string text = page.Text;
        Console.WriteLine(text);
    }
}

Interactive Forms & Embedded File Extraction

The open source PdfPig library natively supports the extraction of interactive Acrobat Forms (AcroForms) and reading files hidden within PDF attachments. It allows developers to check for form definitions globally, iterate through typed interactive form elements (such as checkboxes, text fields, or pushbuttons), and access the underlying raw byte data of files embedded inside document annotations for seamless system interoperability.

How to Extract Interactive Form Data via C# .NET Library?

using System;
using System.Collections.Generic;
using UglyToad.PdfPig;
using UglyToad.PdfPig.AcroForms;

class Program
{
    static void ExtractInteractiveContent(string pdfPath)
    {
        using (PdfDocument document = PdfDocument.Open(pdfPath))
        {
            // Extracting Interactive Form Data
            if (document.TryGetForm(out AcroForm form))
            {
                foreach (var field in form.Fields)
                {
                    Console.WriteLine($"Form Field Name: {field.Information.Name} | Type: {field.Type}");
                }
            }

            // Extracting Embedded Documents/Attachments
            if (document.Advanced.TryGetEmbeddedFiles(out IReadOnlyList embeddedFiles))
            {
                foreach (var file in embeddedFiles)
                {
                    Console.WriteLine($"Found Attachment: {file.Name} ({file.Bytes.Count} bytes)");
                    // File.WriteAllBytes(file.Name, file.Bytes.ToArray()); // Save attachment locally
                }
            }
        }
    }
}

Seamless Multi-File PDF Merging via C#

Beyond content extraction, PdfPig offers straightforward document management utilities. By utilizing the PdfMerger class, developers can consolidate separate independent PDF structures into a single, unified output array of file bytes. This provides an effective way to concatenate programmatic reports, invoices, or client statement documents without adding bulky, external dependencies to your .NET pipeline. The following example demonstrates, how users can combine multiple PDF documents inside .NET applications.

How to Merge Multiple PDF Files inside C# .NET Apps?

using System.IO;
using UglyToad.PdfPig.Writer;

class Program
{
    static void MergeDocuments(string firstPdf, string secondPdf, string outputPdf)
    {
        // Consolidate files into a single continuous byte sequence
        byte[] mergedFileBytes = PdfMerger.Merge(firstPdf, secondPdf);
        
        // Output the combined file stream directly to disk
        File.WriteAllBytes(outputPdf, mergedFileBytes);
    }
}

Read PDF Metadata, Forms, and Annotations

Beyond text extraction, the open source PdfPig library has included features that exposes document metadata (title, author, producer), interactive form fields, hyperlinks, and bookmarks. This is invaluable for document classification systems, automated form processing, and content indexing where you need both content and context.