Alle artikelen

From Mussels to Municipalities: Taming the CBS Open Data APIs

The Dutch government is a great source for datasets. When a ministry has a question like, how many mussels the Dutch economy produces, the Centraal Bureau voor Statistiek (CBS) has them covered. CBS is the dedicated government agency for building datasets and creating statistics about all things Dutch society.

But the CBS services don’t stop at serving the government. Are you building a startup that predicts when house prices will finally drop, so you can buy your first home? The CBS can help you with that. Do you want to know if the Dutch economy is on pace with implementing renewable energy? CBS already measured the economy’s dirty secrets.

Despite this wide variety of data sources, the CBS website and API can be hard to understand and navigate. What is StatLine? Why are some manuals more than 10 years old? How do OData-based APIs work? For new developers, this can be overwhelming.

This guide tackles exactly that. It's divided into two parts. We'll start by covering the basic CBS concepts and how the agency organises its data. Then, we'll move into a hands-on Python tutorial to fetch your first CBS dataset.

How CBS works

When the CBS finishes a dataset and statistical research about a particular topic, the agency publishes it in two ways. The relevant ministry responsible for the investigated subject will get a ‘news message’ which contains all the relevant results. Then, at 00:00 of the publication date, the data will also be published in the databank of the CBS called StatLine.

Sometimes the CBS also publishes an accompanying news article, especially when the research is deemed relevant to society at large. More information on the CBS’ publish policy can be found here.

What is StatLine?

StatLine is the official database (or 'databank') of the CBS. It consists all the datasets associated to the agency's statistical research. The data is published in the OData format on three, generic API endpoints.

What is OData?

The Open Data Protocol, abbreviated as OData, is an ISO-certified standard for building and querying RESTful APIs. It is maintained by OASIS and initiated by Microsoft in 2007.

OData is an almost literal implementation of REST via HTTP. It sees 'everything as a resource', which is also shown in how an Odata-based API would publish a data resource. For instance, this is how a company called Wolk would be retrieved:

GET <https://services.odata.org/v4/CompanyCatalog/Company>('Wolk Tech') HTTP/1.1

How is StatLine data published?

All the agency's data can be found on the https://opendata.cbs.nl/ subdomain, which StatLine divides over three domain paths:

  • /ODataCatalog/ - An OData endpoint for searching all the available datasets in StatLine.

  • /ODataApi/ - A general, but limited OData endpoint for retrieving datasets.

  • /ODataFeed/ - A general, but unlimited OData endpoint for retrieving datasets.

Despite what the path names (and CBS' StatLine introduction) tell you, you can view every endpoint as its own OData API, but providing different scopes of CBS data. This specific API design is, we think, a deliberate choice by CBS.

The API design enable users to easily discover and find datasets. Based on their data requirements, the users then can retrieve it via either the general or bulk API endpoints.

Hands-on with Python

With the CBS context out of the way, let's get our hands dirty with some Python code. For this practical example, we're going to retrieve the population size per year, per municipality in the Netherlands. We'll cover all the discussed opendata.cbs.nl paths in this tutorial.

Setting up

We're going to use the good ol' requests package and the following Python constants:

import requests

BASE_URL = "<https://opendata.cbs.nl>"
CATALOG_URL = f"{BASE_URL}/ODataCatalog/Tables"
API_URL = f"{BASE_URL}/ODataApi/odata"
FEED_URL = f"{BASE_URL}/ODataFeed/odata"

Due to the setup of the Catalog endpoint, we directly query the Tables/-section, which contains all the available datasets in StatLine.

Note: We are deliberately not using the unofficial, open source cbsodata package, because it never uses the regular API endpoint. Also, we would like to keep the Python examples as vanilla and explainable as possible.

Searching for the right dataset

Let's go looking for the dataset containing the population statistics of the Netherlands. We can build up our OData query as follows:

params = {
    "$format": "json",
    "$filter": "substringof('Bevolking', ShortTitle) and substringof('gemeente', ShortDescription)",
    "$select": "Identifier,ShortTitle,RecordCount",
    "$top": 5,
    "$orderby": "RecordCount desc",
}

response = requests.get(CATALOG_URL, params=params)
response.raise_for_status()

Remember, everything is a resource in OData, so we only need to define URL parameters to filter the results down to what we want. We only need to $select a handful of columns (to keep our results readable), retrieve the data in JSON $format and $filter the catalogue tables where ShortTitle contains Bevolking (= population) and ShortDescription mentions gemeente (= municipality).

This gets us the following results:

search_results = response.json().get("value", [])

for item in search_results:
    print(item)
{'Identifier': '84727NED', 'ShortTitle': 'Bevolking;nationaliteit,regio;1995-2022', 'RecordCount': 62233920}

{'Identifier': '03759ned', 'ShortTitle': 'Bevolking; geslacht, leeftijd, regio', 'RecordCount': 54911520}

{'Identifier': '85644NED', 'ShortTitle': 'Bevolking; nationaliteit en regio, 1 jan', 'RecordCount': 23369472}

The second row for dataset ID 03759ned contains the population per gender and age. Also, it contains 54.911.520 rows. This might be the one we need to get the population sizes per municipality (which is essentially a formal 'region').

Retrieving and exploring the dataset

Now let's verify this. We need to know if the dataset has the dimensions we need to filter it down to the municipality level. The DataProperties endpoint tells us what columns are available:

table_id = "03759ned"
url = f"{API_URL}/{table_id}/DataProperties"
params = {"$format": "json"}

response = requests.get(url, params=params)
response.raise_for_status()

We query /DataProperties because the ODataApi URL contains multiple sub-endpoints. This one describes the dataset's columns: dimensions (for filtering) and topics (the measured values).

Note: The same filters applied to the Catalog query can also be used here. The OData protocol enables extensive filtering capabilities.

To verify if this dataset contains the data we need, we print the first row:

properties = response.json().get("value", [])

for prop in properties:
    print(f"{prop['Key']}: {prop['Title']} ({prop['Type']})")
Geslacht: Geslacht (Dimension)
Leeftijd: Leeftijd (Dimension)
BurgerlijkeStaat: Burgerlijke staat (Dimension)
RegioS: Regio's (GeoDimension)
Perioden: Perioden (TimeDimension)
BevolkingOp1Januari_1: Bevolking op 1 januari (Topic)
GemiddeldeBevolking_2: Gemiddelde bevolking  (Topic)

The dataset has a GeoDimension for RegioS (= regions) and a TimeDimension for Perioden (= periods). But where can we find the data about municipalities?

Drilling regions down to municipalities

In OData, each dimension has its own endpoint (= an OData resource) where we can explore the available values. Let's update our query to further research the RegioS endpoint:

from collections import Counter

url = f"{API_URL}/{table_id}/RegioS"
params = {"$format": "json"}

response = requests.get(url, params=params)
response.raise_for_status()

regions = response.json().get("value", [])

prefixes = Counter(r["Key"][:2].strip() for r in regions)
for prefix, count in prefixes.most_common():
    print(f"{prefix}: {count} entries")
GM: 835 entries
CR: 40 entries
PV: 12 entries
LD: 4 entries
NL: 1 entries

We're specifically querying the "Key" here. This returns a fixed key format (double letters + number) representing a particular region type.

By checking this dataset region in the StatLine web application, it reveals the region hierarchy: NL for the country, LD for landsdelen (country parts), PV for provinces, CR for COROP regions, and GM for municipalities (= gemeente).

Let's verify GM with a few examples:

municipalities = [r for r in regions if r["Key"].startswith("GM")][:5]

for item in municipalities:
    print(f"{item['Key']}: {item['Title']}")
GM1680: Aa en Hunze
GM0738: Aalburg
GM0358: Aalsmeer
GM0197: Aalten
GM0480: Ter Aar

These records check out: they are official municipalities in the Netherlands.

Bulk it up!

Now that we found the right dataset, we're going to retrieve the dataset in full! This is where the /ODataFeed/ endpoint comes in, which contains all the data with no row restrictions.

With our prior research, we can build up our OData query:

import csv
import json

url = f"{FEED_URL}/{table_id}/TypedDataSet"

odata_filter = (
    "substringof('GM', RegioS) "        # Only municipalities
    "and Geslacht eq 'T001038' "        # Total of all genders
    "and Leeftijd eq '10000' "          # Total of all ages
    "and BurgerlijkeStaat eq 'T001019'" # Total of all marital statuses
)

params = {
    "$format": "json",
    "$filter": odata_filter,
    "$select": "RegioS,Perioden,BevolkingOp1Januari_1",
}

The specific codes used in the OData filter can be found by querying the {API_URL}/03759ned/{dim} endpoint, where {dim} is a particular dimension. For instance, Leeftijd eq '10000' code will return all possible ages in the dataset.

When sending queries to the ODataFeed endpoint, it returns an odata.nextLink, which is the API's pagination. To query the full paginated dataset:

all_data = []

while url:
    response = requests.get(url, params=params)
    response.raise_for_status()

    result = response.json()
    all_data.extend(result.get("value", []))

    url = result.get("odata.nextLink")
    params = {}  # nextLink includes all params

print(f"Retrieved {len(all_data)} records")
Retrieved 31730 records

Finally, we can export the retrieved data to JSON and/or CSV:

with open("population_per_municipality.json", "w") as f:
    json.dump(all_data, f, indent=2)

with open("population_per_municipality.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=all_data[0].keys())
    writer.writeheader()
    writer.writerows(all_data)

And that's it! You now have the population counts per Dutch municipality, per year, ready for analysis.

Conclusion

When CBS publishes their new dataset about how many mussels the Dutch economy produces on StatLine, this guide will get you covered with kick-starting your data analysis. The big 'feature' of the StatLine databank is the OData API standard, which makes the retrieval process easier by making every dataset a resource, something which can be queried.

This guide showed how this can be done by leveraging the three main endpoints provided by the CBS. We first discovered the datasets for population sizes, checked them via the regular API and ended up filtering and importing a dataset from the bulk endpoint.

We used the resource querying extensively when we filtered down from the dataset's DataProperties, to the RegioS properties, which in turn were used to retrieve the municipality codes. Here's where we discovered two things: sometimes, it's better to quickly view the dataset in the StatLine web application to examine its properties, and that filtering is an essential part of working with OData in general.

Filtering will help to get your required dataset to directly match your data requirements, all while querying the API. This lowers the data bandwidth and can make the data directly usable. To get good at filtering with OData APIs, we recommend reading the very good Microsoft documentation about it. If you like large code standard specifications, we recommend reading the official OData spec.

However, this guide will not last forever. CBS is in the process of migrating its OData API from version 3.0 to version 4.0. This may change the way you query datasets.


Blijf op de hoogte!

Schrijf je in voor onze nieuwsbrief, de Wolkskrant, om de laatste tools, trends en tips van de industrie te ontvangen.

Subscribe