1 Introduction

When approaching the challenge of systematically maintain data ingestion from Eurostat with few DevOps resources, one can use Eurostat’s Bulk Download Facility

Updates are done twice a day, at 11:00 and 23:00, and the data is available in two formats: tsv (tab separated values) and SDMX (Statistical Data and Metadata eXchange).

Those allow access to updated “plain text” versions of table of content (TOC) of their data structure and data sets using (REST requests) patterns available, which can be easily implemented in “R/Phyton-minded” research projects.

There are 3 simple steps to retrieve data from Eurostat using R:
1. Search in Eurostat’s TOC to retrieve time series codes. 2. Fetch data-sets or time-series. 3. Retrieve related information from dictionaries.

Although there’s is a recommended previous step which is understanding how Eurostat structures things.

2 Basics of Bulk Download Facility.

Have a look at the Bulk Download Listing .

Here you will easily understand what information is available. Specially review the PDF document BulkDownload_Guidelines.pdf.

Eurostat’s Bulk Download Content.

The first thing you need to notice is that lists describing the Data Structure will include not only information about data sets or tables but also information on folders and files.

An advantage of official statistics is that they rely on well-established standards and have been developing from quite a while. Meaning that once a table/data flow is properly identified its REST request statement to files/folders remain consistently in place. Therefore we can rely on this structure to remain stable and build upon it in our CI framework.

3 One, two, three… data!

To keep it straight forward I will start downloading the English version of the table of content, then I will look for time series related to Excessive Dept Procedure (EDP) and retrieve it.

3.1 Retrieve list of available data in the facility.

Start downloading the English version of the table of content TOC file: table_of_contents_en.txt available in the Bulk Download Facility which is “plain text” (Tab separated values file or tsv):

Eurostat’s English TOC. Tab separated values file.

# Local paths are used as a
# variable within the code.
# Set your paths here:
loca <- paste0(getwd(), "//auxi//")
# We are going to need
# REST requests to connect
# with:

  # The table of contents:
    con_toc <- "https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=table_of_contents_en.txt"
  
  # Data directory:
    con_dir <- "https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&"

  # Data file: is built using methods.
  
    
  # Create string with
  # REST request
  # to a TSV data file.
  # e.g. con_tsv("tipsgo10")
  
    # Input is a string with the 
    # id of the data file "tipsgo10".
    # Output is a string.

    con_tsv <- function (datid) {
      contsv <- paste0 (con_dir,"file=data%2F",
              datid ,".tsv.gz")
      return (contsv)
    }
  
  
  # Create string with 
  # transformed REST request
  # to a TSV data file.
  # e.g. contr_tsv ("tipsgo10")
    
    # Input is a string with the 
    # id of the data file "tipsgo10".
    # Output is a string.
    # Dependencies: con_tsv()
      contr_tsv <- function (datid) {
        contsvt <- gzcon ( 
          url (con_tsv(datid))
          )
          return (contsvt)
      }

After downloading the TOC file and pass it to a data set in R…

  # TOC in English:
  toc <- read.table (
  con_toc, header = T, sep = "\t")
  
  # Trim title string
  toc$title <- str_trim (toc$title) 

… we can see that there are 1536 tables.

Searching through the text for the table codes we may be interested in, does not require high programming skills while it allows for systematic integration.

3.2 Identify the series of data in the TOC

Specific data sets are available in the “data” directory.

File name differ according if it is SDMX or TSV.
  SDMX:   aact_ali01.sdmx.zip
  TSV:    aact_ali01.tsv.gz
  

Each data set has a unique code which we need to identify to use the appropriate file REST request with the pattern:

[con]file=data%2F[data flow code][file extension]

For example, to capture the data table on International trade and foreign direct investment which online data code is tet00045, REST request to access/download the compressed TSV file will be (attention: clicking on the URL will download the file):

https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Ftgs00045.tsv.gz

But using this method we need an extra step to read the file as it is compressed. See 3rd step to see a way around this.

Let’s consider we want to identify series related to Public Debt, in the context of the Macroeconomic Surveillance Procedure of the European Commission.

We can have a quick look in the table of content- I am looking for the string “EDP” (Excessive Debt Procedure) in the table title inside the TOC I’ve just saved:

  #Note that TOC includes 
  # different type of elements:
  # "comext", "dataset", "folders" 
  # and "table". You should make 
  # sure that you are searching 
  # for the right element.

    #Filter for tables in the TOC.
    ftab <- toc %>%
      filter (type=="table")
    
    #Data tables that match the case:
    kable ( 
      ftab [grep ("EDP", ftab$title),c("code", "title")]
      )
code title
822 tipsgo10 General government gross debt (EDP concept), consolidated - annual data
888 tipsgo20 General government gross debt (EDP concept), consolidated - quarterly data

You should keep in mind those REST request are downloading compressed files, so you need either to manually decompress them or find your way around to have them ready for you R/Python/whatever context you use. See next section for R example.

4 Access and explore the data

Once the code is identified let’s explore to options: download and decompress, or use get_eurostat method from eurostat package.

5 Dictionaries

How do I know that the variable in the example was referring to “Government consolidated gross debt”.

In the file the national account item (na_item) is “GD”.

One must consult the dictionaries in the Bulk download facility to retrieve the necessary dictionaries (alternatively Data Structure files are available in SXML).

The Bulk Download Facility’s dictionaries are organized by concepts, por example there is one dictionary for the variable “GEO” with country codes, or for national account items (na_items).

Again, there is a method in the eurostat package that allows to easily retrieve dictionaries in a specific language (English, French or German).

#For example, let's retrieve 
# national account dictionary:
naitemdic <- get_eurostat_dic(
  "na_item", lang = "en")

GD stands for: Government consolidated gross debt

The methods in Eurostat package seem to work pretty smoothly and fit the research needs we have right now. See example.

From here, “the sky is your limit” ;)

6 Addendum

Eurostat’s Web Services: SDMX, JSON and UNICODE.

When having enough DevOps Resources, in a Continuous Integration (CI) framework, either SDMX, JSAON or UNICODE Web Services from Eurostat provide REST request structures and APIs to work with.

Eurostat’s Data Structure in a SDMX file.

Useful links for SDMX Web Services usage:

Instructions and usage information for:

Notes:

I have also localized a package to open SDMX files which would allow download specific queries.
https://cran.r-project.org/web/packages/rsdmx/vignettes/quickstart.html