Open Source Musings The thoughts, ideas, and opinions of an open source guy

Doing data journalism with open source software

A couple of newspapers and a pair of glasses

(Note: This post was first published, in a slightly different form, at It appears here via a CC-BY-SA 4.0 license.)

When I was in journalism school back in the late 1980s, gathering data for a story usually involved eye-numbing hours of poring over printed documents or microfiche.

A lot has changed since then. While printed resources are still useful, more and more information is available to journalists on the web. That’s helped fuel a boom in what’s come to be known as data journalism. At its most basic, data journalism is the act of finding and telling stories using data — like census data, crime statistics, demographics, and more.

There are a number of powerful and expensive tools that enable journalists to gather, clean, analyze, and visualize data for their stories. But many smaller or struggling news organizations, let alone independent journalist, just don’t have to budget for those tools. But that doesn’t mean they’re out in the cold.

There are a number of solid open source tools for data journalists that do the job both efficiently and impressively. This article looks at six tools that can help data journalists get the information that they need.

Grabbing the data

Much of the data that journalists find on the web they can download as a spreadsheet or as CSV or PDF files. But there’s a lot of information that’s embedded in web pages. Instead of manually copying and pasting that information, a trick just about every data journalist uses is scraping. Scraping is the act of using an automated tool to grab information embedded in a web page, often in the form of an HTML table.

If you, or someone in your organization, is of a technical bent then Scrapy might be the tool for you. Written in Python, Scrapy can quickly extract structured data from web pages. Scrapy can be a bit challenging to install and set up, but once it’s up and running it has a number of useful features and Python savvy programmers can quickly extend those features.

Spreadsheets are one of the basic tools of the data journalist. In the open source world, LibreOffice Calc is the most widely-used spreadsheet editor. Calc isn’t just for viewing and manipulating data. By taking advantage of its Web Page Query import filter, you can point Calc to a web page containing data in tables and grab one or all of the tables on page. While it’s not as fast or efficient as Scrapy, Calc gets the job done nicely.

Dealing with PDFs

Whether by accident or by design, a lot of data on the web is locked in PDF files. Many of those PDFs can contain useful information. If you’ve done any work with PDFs, you know that getting data out of them can be a chore.

That’s where DocHive, a tool developed by the Raleigh Public Record for extracting both data and images from PDFs, comes in. DocHive takes a PDF, separates it into smaller pieces, and then uses optical character recognition to read the text and inject them into a CSV file. Read more about DocHive in this article at

Tabula is similar to DocHive. It’s designed to grab tabular information in a PDF and convert it to a CSV file or a Microsoft Excel spreadsheet. All you need to do is find a table in the PDF, select it, and let Tabula do the rest. It’s quite fast and efficient.

Cleaning your data

Often, the data you’ll grab may contain spelling and formatting errors, or problems with character encoding. That makes the data inconsistent and unreliable, and makes cleaning the data essential.

If you have a small data set, one that consists of a few hundred rows of information, then you can use LibreOffice Calc and your eyeballs to do the cleanup. But if you have larger data sets, doing the job manually will be a long, slow, inefficient process.

Instead, turn to OpenRefine. It automates the process of manipulating and cleaning your data. OpenRefine can sort your data, automatically find duplicate entries, and reorder your data. The real power of OpenRefine comes from facets. Facets are like filters in spreadsheets that let you zoom in on specific rows of data. You can use facets to ferret out blank cells and duplicate data, as well as see how often certain values appear in the data.

OpenRefine can do a lot more than that. You can get an idea about what OpenRefine can do by browsing the documentation.

Visualizing your data

Having the data and writing a story with it is all well and good. A good graphic based on that data can be a boon when trying to communicate and understand that data. That explains the popularity of infographics on the web and in print.

If your needs aren’t too complex, you don’t need to be a wizard with graphic design to create an effective visualization. Data Wrapper is an online tool that breaks creating a visualization into four steps: copy data from a spreadsheet, describe your data, choose the type of image you want, then generate the graphic. You don’t get a wide range of image types, but the process couldn’t be easier.

Obviously, this isn’t an exhaustive list of open source data journalism tools. But the tools discussed in this article provide a solid platform for a journalism organization on a budget, or even an intrepid freelancer, to use data to generate story ideas and to back those stories up.

Thoughts? Let's start a conversation on Twitter.

Did you enjoy this post or find it useful? Then please consider supporting this blog with a micropayment via Liberapay. Thanks!