Abstract
In this post there is a brief introduction on how to fetch data from Erustat for data ingestion in a continuous integration (CI) framework for Data Scientists. This may help Data Scientists with interest to device projects with CI and Deployment focus to add extra value to the research phase of a project.
In this post you can review a proposal for retrieving data from Eurostat to completely or partially automate data set updates using R. Although it provides useful links for a full-integrated development using SDMX Web Services and APIs, the focus is to understand the Buldk Download facility from Eurostat and how to organize the methods to fetch it an prepare it.
Update: Have a look at this RPubs using Eurostat package https://rpubs.com/IbelenI/macrofcts.
When approaching the challenge of systematically maintain data ingestion from Eurostat with few DevOps resources, one can use Eurostat’s Bulk Download Facility
Updates are done twice a day, at 11:00 and 23:00, and the data is available in two formats: tsv (tab separated values) and SDMX (Statistical Data and Metadata eXchange).
Those allow access to updated “plain text” versions of table of content (TOC) of their data structure and data sets using (REST requests) patterns available, which can be easily implemented in “R/Phyton-minded” research projects.
There are 3 simple steps to retrieve data from Eurostat using R:
1. Search in Eurostat’s TOC to retrieve time series codes. 2. Fetch data-sets or time-series. 3. Retrieve related information from dictionaries.
Although there’s is a recommended previous step which is understanding how Eurostat structures things.
Have a look at the Bulk Download Listing .
Here you will easily understand what information is available. Specially review the PDF document BulkDownload_Guidelines.pdf.
Eurostat’s Bulk Download Content.
The first thing you need to notice is that lists describing the Data Structure will include not only information about data sets or tables but also information on folders and files.
An advantage of official statistics is that they rely on well-established standards and have been developing from quite a while. Meaning that once a table/data flow is properly identified its REST request statement to files/folders remain consistently in place. Therefore we can rely on this structure to remain stable and build upon it in our CI framework.
To keep it straight forward I will start downloading the English version of the table of content, then I will look for time series related to Excessive Dept Procedure (EDP) and retrieve it.
Start downloading the English version of the table of content TOC file: table_of_contents_en.txt available in the Bulk Download Facility which is “plain text” (Tab separated values file or tsv):
Eurostat’s English TOC. Tab separated values file.
# Local paths are used as a
# variable within the code.
# Set your paths here:
loca <- paste0(getwd(), "//auxi//")
# We are going to need
# REST requests to connect
# with:
# The table of contents:
con_toc <- "https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=table_of_contents_en.txt"
# Data directory:
con_dir <- "https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&"
# Data file: is built using methods.
# Create string with
# REST request
# to a TSV data file.
# e.g. con_tsv("tipsgo10")
# Input is a string with the
# id of the data file "tipsgo10".
# Output is a string.
con_tsv <- function (datid) {
contsv <- paste0 (con_dir,"file=data%2F",
datid ,".tsv.gz")
return (contsv)
}
# Create string with
# transformed REST request
# to a TSV data file.
# e.g. contr_tsv ("tipsgo10")
# Input is a string with the
# id of the data file "tipsgo10".
# Output is a string.
# Dependencies: con_tsv()
contr_tsv <- function (datid) {
contsvt <- gzcon (
url (con_tsv(datid))
)
return (contsvt)
}
After downloading the TOC file and pass it to a data set in R…
# TOC in English:
toc <- read.table (
con_toc, header = T, sep = "\t")
# Trim title string
toc$title <- str_trim (toc$title)
… we can see that there are 1536 tables.
Searching through the text for the table codes we may be interested in, does not require high programming skills while it allows for systematic integration.
Specific data sets are available in the “data” directory.
File name differ according if it is SDMX or TSV.
SDMX: aact_ali01.sdmx.zip
TSV: aact_ali01.tsv.gz
Each data set has a unique code which we need to identify to use the appropriate file REST request with the pattern:
[con]file=data%2F[data flow code][file extension]
For example, to capture the data table on International trade and foreign direct investment which online data code is tet00045, REST request to access/download the compressed TSV file will be (attention: clicking on the URL will download the file):
But using this method we need an extra step to read the file as it is compressed. See 3rd step to see a way around this.
Let’s consider we want to identify series related to Public Debt, in the context of the Macroeconomic Surveillance Procedure of the European Commission.
We can have a quick look in the table of content- I am looking for the string “EDP” (Excessive Debt Procedure) in the table title inside the TOC I’ve just saved:
#Note that TOC includes
# different type of elements:
# "comext", "dataset", "folders"
# and "table". You should make
# sure that you are searching
# for the right element.
#Filter for tables in the TOC.
ftab <- toc %>%
filter (type=="table")
#Data tables that match the case:
kable (
ftab [grep ("EDP", ftab$title),c("code", "title")]
)
code | title | |
---|---|---|
822 | tipsgo10 | General government gross debt (EDP concept), consolidated - annual data |
888 | tipsgo20 | General government gross debt (EDP concept), consolidated - quarterly data |
You should keep in mind those REST request are downloading compressed files, so you need either to manually decompress them or find your way around to have them ready for you R/Python/whatever context you use. See next section for R example.
Once the code is identified let’s explore to options: download and decompress, or use get_eurostat method from eurostat package.
R has a method to modify a connection that wraps an existing connection, and decompresses reads or compresses writes through it (Source). The method has been used to device the method in the function contr_tsv devised earlier and that can be used here.
# REST request for the
# data of interest:
# Fetch TOC: Save it and
# trim the title string.
toc <- read.table (
con_toc, header = T, sep = "\t")
toc$title <- str_trim (
toc$title, side = "both")
# Filter for tables in the TOC.
ftab <- toc %>%
filter (type=="table")
#There are 2 data sets with
# "EDP" string in the title,
# let's fetch the first one
# "tipsgo10":
code <- ftab [grep ("EDP",
ftab$title),
"code"][1]
# Transformed connection to
# a specific file.
con_file <- contr_tsv ( code )
#Read as text file.
tempf <- readLines(con_file)
# Pass it into a data frame,
# remember is a TSV file with a header.
tipsgo10 <- read.table(
textConnection(tempf),
encoding = "UTF-8",
sep = "\t", header = T)
# Have a look first 3 columns
kable(
head (tipsgo10[,c(1:3)]),
caption = "First Three columns of the data set"
)
na_item.sector.unit.geo.time | X1995 | X1996 |
---|---|---|
GD,S13,MIO_NAC,AT | 120660.0 | 124601.6 |
GD,S13,MIO_NAC,BE | 276343.7 | 276440.2 |
GD,S13,MIO_NAC,BG | : | : |
GD,S13,MIO_NAC,CY | 3765.7 | 4012.6 |
GD,S13,MIO_NAC,CZ | 216645.0 | 211777.0 |
GD,S13,MIO_NAC,DE | 1040187.3 | 1110441.1 |
Looking at the format of the TSV file, we can see that is a set of time series with all descriptive elements wrapped in the first column. So to work with the data, we still have to do a little work on it.
For example, if we want to filter the data according to country (GEO) or national account item (na_item) we need to split the first column in to different variables.
eu_tsv <- function (x) {
# Little program to unwrap the
# first column, indicate number
# of NAs, first and last year,
# and number of observations in
# the time series.
# The input is an DATA FRAME
# from Eurostat for annual data.
# First and last observation
# are in the TOC data set,
# although I prefer to take
# them from the actual data.
#--- 1- Set of functions to ###
# find first and last observation,
# NAs data between those,
# number of observations and
# remove labels from observations.
# Find last available observation
findlast <- function (y){
str_sub (names (y)[-1][which (
!is.na (y[-1])
)][1], 2, 5)
}
# Find first available observation
findfirst <- function (y){
str_sub (names (y[-1])[which (
!is.na (y[-1]))][length (names (
y[-1])[which (!is.na (y[-1]))]
) ], 2, 5)
}
# Number of NANs
findnas <- function (y){
length (which (is.na(y)))
}
# Number of valid observations
numobs <- function (y){
length (which (!is.na(y)))
}
# Remove labels
laboff <- function (y){
as.numeric (gsub("[^c(-inf)-9\\.]",
"", y))
}
#--- 2- Apply previous functions ###
# to obtain the resulting data frame.
f <- data.frame (apply (x[,-1], 2, laboff))
e <- data.frame (
end = as.integer (apply (x,1,findfirst)),
start = as.integer (apply (x,1,findlast)),
#According to documentation
# series are stored in descending order.
NAs = apply (x,1,findnas),
obs = apply (x, 1, numobs)
)
#--- 3- Split first column ###
# First column in to char.
char <- c (as.character (x[,1]))
# Split string in to list.
l <- strsplit ( char, ",")
# Pass it into a data frame
d <- data.frame (
matrix ( unlist(l),
nrow=length(l), byrow=T)
)
# Name new variables in the data set:
# Character vector with names.
namd <- unlist (strsplit (
str_replace_all (
names (x)[1], "\\.",","),
","))
# Assign names, discard empty var.
names (d) <- namd[-length(namd)]
#--- 4- Devise output and return ###
d <- cbind (e, d, f)
return (d)
}
Let’s use the function with the table we just downloaded:
eu_gd <- eu_tsv (tipsgo10)
# Note: Column NAs are counting non-available
# data between the first and the last available
# period. It does not consider NAs introduced at
# the beginning/end of series with less data to
# fit in to the data frame.
# Have a look
kable(
head (eu_gd[, c(1:10)]),
caption = "Data set after applying eu_tsv method"
)
end | start | NAs | obs | na_item | sector | unit | geo | X1995 | X1996 |
---|---|---|---|---|---|---|---|---|---|
2019 | 1995 | 0 | 26 | GD | S13 | MIO_NAC | AT | 120660.0 | 124601.6 |
2019 | 1995 | 0 | 26 | GD | S13 | MIO_NAC | BE | 276343.7 | 276440.2 |
2019 | 1995 | 0 | 26 | GD | S13 | MIO_NAC | BG | NA | NA |
2019 | 1995 | 0 | 26 | GD | S13 | MIO_NAC | CY | 3765.7 | 4012.6 |
2019 | 1995 | 0 | 26 | GD | S13 | MIO_NAC | CZ | 216645.0 | 211777.0 |
2019 | 1995 | 0 | 26 | GD | S13 | MIO_NAC | DE | 1040187.3 | 1110441.1 |
Let’s use the output: simple plots of Gross government debt for Spain and Belgium since 2002.
# Reshape and filter:
leu_gd <- eu_gd %>%
pivot_longer ( #long format
c(X1999:X2019),
names_to = "time",
values_to = "values")%>%
mutate (time = #time numeric
gsub ("X","", time))%>%
filter (time>2001)%>% #select period
filter (geo%in%c("ES","BE"))%>%
filter (unit=="PC_GDP") %>%
mutate (time = factor(time)) %>%
select (geo, values, time) #select geo
# Basic line plot
theme_set(theme_minimal())
p <- ggplot(
data = leu_gd,
aes(x = time,
y = values,
group=geo,color=geo)
)+
geom_line(size = 2)
p
From previous step you can see that I have been going through a lot of trouble to device a function simply to arrange de data.
While researching for the function I came across a method created in the R package eurostat which can make life easier.
Using “get_eurostat” simplifies the process as the output is an object ready to work. Although, each new method that has not been developed by oneself should be properly tested, the “first look” presented here looks promising.
In the example below I have just used it in a way that may be comparable with previous section but there are other useful methods in the package to customize data query.
# Call the package
# library (eruostat)
# Retrieve the data:
eu_gdme <- get_eurostat ("tipsgo10", cache=F)
# Have a look: it looks better
kable(
head (eu_gdme),
caption = "Data set fetched using eurostat package"
)
na_item | sector | unit | geo | time | values |
---|---|---|---|---|---|
GD | S13 | MIO_NAC | AT | 1995-01-01 | 120660.0 |
GD | S13 | MIO_NAC | BE | 1995-01-01 | 276343.7 |
GD | S13 | MIO_NAC | CY | 1995-01-01 | 3765.7 |
GD | S13 | MIO_NAC | CZ | 1995-01-01 | 216645.0 |
GD | S13 | MIO_NAC | DE | 1995-01-01 | 1040187.3 |
GD | S13 | MIO_NAC | EE | 1995-01-01 | 227.5 |
That was fast and the data looks just right shaped. Let’s use the data samelesly as before.
# Filter and plot: Gross government
# debt for Spain and Belgium since 2002.
leu_gdme <-
eu_gdme %>%
filter (year(time)>2001)%>%
filter (geo%in%c("ES","BE"))%>%
filter (unit=="PC_GDP") %>%
mutate (year = factor(year (time))) %>%
select (geo, values, year)
# Basic line plot
theme_set(theme_minimal())
p <- ggplot(
data = leu_gd,
aes(x = time, y = values,
group=geo,color=geo)
)+
geom_line(size = 2)
p
# Note: Remember the library may be using a
# cache in Eurostat folder.
# If you are sure you won't be needing extra
# memory, you can also set cache=F in the
# "get_eurostat" function as I did.
# The function to clean the cache:
# clean_eurostat_cache(cache_dir = NULL)
Update: I have a more detailed example on the applied use of Eurostat package <a href="https://rpubs.com/IbelenI/tsmacro">here</a>.
How do I know that the variable in the example was referring to “Government consolidated gross debt”.
In the file the national account item (na_item) is “GD”.
One must consult the dictionaries in the Bulk download facility to retrieve the necessary dictionaries (alternatively Data Structure files are available in SXML).
The Bulk Download Facility’s dictionaries are organized by concepts, por example there is one dictionary for the variable “GEO” with country codes, or for national account items (na_items).
Again, there is a method in the eurostat package that allows to easily retrieve dictionaries in a specific language (English, French or German).
#For example, let's retrieve
# national account dictionary:
naitemdic <- get_eurostat_dic(
"na_item", lang = "en")
GD stands for: Government consolidated gross debt
The methods in Eurostat package seem to work pretty smoothly and fit the research needs we have right now. See example.
From here, “the sky is your limit” ;)
Eurostat’s Web Services: SDMX, JSON and UNICODE.
When having enough DevOps Resources, in a Continuous Integration (CI) framework, either SDMX, JSAON or UNICODE Web Services from Eurostat provide REST request structures and APIs to work with.
Eurostat’s Data Structure in a SDMX file.
Useful links for SDMX Web Services usage:
Instructions and usage information for:
Notes:
I have also localized a package to open SDMX files which would allow download specific queries.
https://cran.r-project.org/web/packages/rsdmx/vignettes/quickstart.html