Free Java API to Extract, Parse & Process HTML

Open Source Java Pibrary for Loading, Parsing, Fetching URLs, Extracting and Manipulating Data, using the best of HTML5 DOM methods and CSS selectors.

Jsoup is a very powerful Java library that enables software developers to handle HTML related tasks inside their own Java applications. When it comes to web scraping and HTML parsing in Java, Jsoup has emerged as a popular and powerful library. It provides a convenient and intuitive way to parse and extract data from HTML documents, manipulate the DOM, and traverse the HTML structure with ease. Jsoup is an open source project distributed under the liberal MIT license. As a Java library, Jsoup seamlessly integrates with existing Java projects, making it an ideal choice for Java developers.

Jsoup is free to use Java library that enables software developers to extract and manipulate data from HTML and XML documents. It acts as a convenient bridge between Java and the world of web scraping, offering a robust set of features for fetching, parsing, manipulating, and traversing HTML content. Whether you need to extract specific data from a web page, scrape multiple pages, or modify the structure of HTML documents, Jsoup provides a user-friendly API to accomplish these tasks.

The Jsoup provides a clean and intuitive API that makes it beginner-friendly and quick to learn. With straightforward methods for parsing, manipulating, and traversing HTML, developers can start scraping web data in no time. Its extensibility allows software developers to build custom functionality on top of the library. It supports user-defined attributes, custom filters, and custom traversers, enabling software developers to tailor the library to their specific needs. Its intuitive API, coupled with a wide range of features, makes it a popular choice among developers. Whether you need to extract data, manipulate the DOM, or handle complex HTML scenarios, Jsoup simplifies the process and provides robust solutions.

At A Glance

An overview of Jsoup features.

Features Overview

Web Scraping
Manipulate HTML Files
Extract Images from HTML
Open HTML
Read HTML
Parse HTML
HTML rendering
HTML Viewer
HTML to PDF
Extract TOC
Extract plain text

Jsoup

Jsoup supports HTML file format as well as industry-standard formats for export.

Reader

HTML

Writer

TXT, HTML , PDF

Jsoup

Platform Independence

Jsoup only requires Java runtime.

Java 8 and above.

Jsoup

Getting Started with Jsoup

The recommended way to use Jsoup is by including the necessary Maven or Gradle configuration or by manually importing the JAR file into your project. Please add the maven dependency for smooth working.

Jsoup Maven Dependency

<dependency>
  
  <groupId>org.jsoup</groupId>
  <artifactId>Jsoup</artifactId>
  <version>1.16.1</version>
</dependency>

Install Jsoup Library via Gradle

// jsoup HTML parser library @ https://jsoup.org/
implementation 'org.jsoup:jsoup:1.16.1'
Or Vig GitHub
git clone https://github.com/jhy/jsoup.git
cd jsoup
mvn install

You can also install it manually; download the latest release files directly from GitHub repository.

Parsing HTML File using Java API

One of the core functionalities of the open source Jsoup library is its ability to parse HTML documents inside Java applications. The library allows to extract data from HTML documents using a URL pointing to a webpage, raw HTML strings, or directly by load a file from the disk. To parse an HTML document using Jsoup, software developers can utilize the Jsoup.parse() method. This method accepts the HTML content as a string and returns a Document object representing the parsed HTML. From there, developers can navigate the DOM tree and extract the desired elements using selectors or traversing methods. The following example shows how to parse a webpage to a DOM, and select the headlines from it using java commands.

How to Parse HTML File and Extract Headlines from it via Java API?

Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
  log("%s\n\t%s", 
    headline.attr("title"), headline.absUrl("href"));
}

Extract Data from HTML File using Java

The Jsoup library has included very useful functions for load and extract data from HTML documents inside Java applications. The library offers intuitive methods to extract data from HTML elements. It supports text extraction, attribute retrieval, and HTML serialization, providing developers with the necessary tools to extract and manipulate data as required. This makes it easier to integrate web scraping functionalities into Java applications. The following example shows how programmers can extract attributes, text, and HTML from elements inside Java applications.

How to Extract Attributes, Text, and HTML from Elements via Java API?

String html = "An example link.";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""

String linkOuterH = link.outerHtml(); 
    // "example"
String linkInnerH = link.html(); // "example"

Edit and Manipulate HTML Files via Java API

The open source Jsoup library allows software developers to load and modify HTML documents effortlessly inside their own Java applications. Whether it's adding, removing, or modifying elements, Jsoup provides a convenient API to manipulate the HTML structure. This feature proves invaluable when scraping data and saving it in a desired format or when building tools that modify HTML content programmatically.