Free Python Library to Extract Contents from HTML Pages

Open Source Python Library for Extracting Web Pages Contents like Images & Text, Publication Date, Language Info and So on..

In the age of digital information, data is abundant and readily available on the internet. Extracting relevant information from websites can be a cumbersome task, but with the Python-Goose library, web scraping becomes a breeze. Python-Goose is a robust and user-friendly library that allows developers to extract structured data from web pages effortlessly. It was developed by Julian Moreno Patiño and is designed to extract meaningful content, such as articles, from web pages. The library uses a set of heuristics and natural language processing techniques to analyze HTML documents and identify relevant textual content.

Python-Goose is very easy to install and has provided complete support for handling multilingual websites. The library incorporates language detection capabilities, which automatically identify the language of the web page's content. There are several important features part of the library, such as content aggregation, extracting structured data for research and analysis purposes, video extraction, generating summaries of articles, preprocessing web data for training machine learning models, extracting images from web pages, and many more.

Python-Goose is an open source library written in Python and provides a simple yet effective way of handling web scraping tasks that aim to extract some valuable information from web pages inside Python applications. It employs various algorithms and heuristics to identify the most relevant text and discard irrelevant elements. Whether you're a data scientist, market researcher, building a data-driven application, or conducting research, Python-Goose can significantly streamline your workflow and help you extract valuable insights from the vast expanse of the internet.

At A Glance

An overview of Python-Goose features.

Features Overview

Web Scrapping
Web Pages Extraction
Extract Images
Extract Text
Unicode support
Extract Publication Date
Languages Detection
Video extraction
Parse HTML
HTML rendering
HTML Viewer
HTML to PDF

Python-Goose

Python-Goose supports HTML file format as well as industry-standard formats for export.

Reader

HTML

Writer

TXT, HTML , PDF

Python-Goose

Platform Independence

Python-Goose is tested with Python 2.6 and higher.

Python 2.6 and higher.

Python-Goose

Getting Started with Python-Goose

The recommended and easiest way to install Python-Goose is using Composer, the dependency management tool for PHP. Please use the following command a smooth installation.

Install Python-Goose via pip

pip install goose3

You can also install it manually; download the latest release files directly from GitHub repository.

Article Extraction using Python API

The open source Python-Goose library has provided very useful features for loading and extracting contents from various types of websites. The library specializes in extracting articles from web pages, filtering out unnecessary elements such as advertisements, headers, footers, and sidebars. Moreover, it focuses on retrieving the core content, including the title, text, author, publication date, and other relevant metadata. Here is a very useful example that shows how users can define the URL of the web page they want to scrape and can easily access various properties of the article object, such as the title, authors, publish date, and cleaned text.

How to Extract Articles from a Website via Python API?

from goose3 import Goose

url = "https://example.com/article"
g = Goose()
article = g.extract(url)

print("Title:", article.title)
print("Authors:", article.authors)
print("Publish Date:", article.publish_date)
print("Article Text:", article.cleaned_text)

Extract Particular Webpage Info in Multiple Languages via Python

The Python-Goose library incorporates language detection functionality, allowing it to identify the language of a web page's content automatically. This feature is particularly useful when dealing with multilingual websites or when you want to filter content based on language. The library allows software developers to access a particular part of the webpage and extract it such as main text of an article, images of article, Meta description, Meta tags, and so on. The following example shows how to extract or scrape a Spanish content using Python code.

How to Extract Spanish Content from a Webpage inside Python Apps?

from goose import Goose
url = 'http://sociedad.elpais.com/sociedad/2012/10/27/actualidad/1351332873_157836.html'
g = Goose()
article = g.extract(url=url)
article.title
u'Las listas de espera se agravan'
article.cleaned_text[:150]
u'Los recortes pasan factura a los pacientes. De diciembre de 2010 a junio de 2012 las listas de espera para operarse aumentaron un 125%. Hay m\xe1s ciudad'