Another weekend, another small visualization project with publicly available data I do. This time, I have tackled the problem of unemployment in Europe: how has it changed per country since the start of the crisis in 2007, and the posterior European debt meltdown? Surely we can plot it to have a better look.

First, let's get the data. EuroStat is the place to go with everything involving European numbers. In this case, I have used this dataset. This very long URL will allow any interested person to download exactly the same tables I used. This dataset also includes countries as Iceland, but I am not removing them.

I've done two different visualizations of the unemployment data: the standard one, showing unemployment rates with very little stratification (currently, the dataset includes statistics by sex and age (< 25 or > 25 only)). On the other hand, I thought it would also be cool to find the minimum unemployment rate for each country before the crisis started (around 2007) and use those values for normalization. That way, it can be seen how the different rates have increased as the different crises come and go.

So, as always, with R, let's start by loading the required libraries and cleaning the data:

library(ggplot2)
theme_set(theme_bw(20))
rm(list = ls())
euro <- read.csv("data/une_rt_m_1_Data.csv", stringsAsFactors = FALSE)

# Let's clean the data
euro$Value <- as.numeric(euro$Value)
euro$Flag.and.Footnotes <- NULL
euro$AGE <- factor(euro$AGE)
euro$GEO <- factor(euro$GEO)
euro$S_ADJ <- factor(euro$S_ADJ)
euro$SEX <- factor(euro$SEX)

# Correct Germany's name here
l1 <- levels(euro$GEO)
l1[grepl("Germany", l1)] <- "Germany"
levels(euro$GEO) <- l1

# Get true date, and month and year from euro$TIME variable
tmp1 <- strsplit(euro$TIME, "M")
tmp1 <- do.call(rbind, tmp1)
euro$year <- as.numeric(tmp1[, 1])
euro$month <- as.numeric(tmp1[, 2])
euro$date <- as.Date(sprintf("%s-%02d-01", euro$year, euro$month))

# Get only countries we want (remove all average European metrics)
euro <- euro[!grepl("Euro", euro$GEO), ]
euro$GEO <- factor(euro$GEO)

# Only use seasonally adjusted data
euro <- euro[euro$S_ADJ == "Seasonally adjusted data", ]
euro$S_ADJ <- NULL

Actually, some of the filtering I am doing here could have been done when generating the data table, but I discovered it afterwards and it was faster for me to add a couple of lines to my script than go back and fix it at the source.

Now I can compute the minimum (and the maximum, now that I'm at it) so I can normalize afterwards:

avg_unemployment <- aggregate(Value ~ year + SEX + GEO + AGE, euro, mean, 
                          na.rm = TRUE)
# And now get the minimum (and the maximum for later). The minimum is taken before
# the crisis
min_unemployment <- aggregate(Value ~ SEX + GEO + AGE, 
                              avg_unemployment[avg_unemployment$year < 2007, ], 
                              min, 
                              na.rm = TRUE)
# The maximum is taken after the crisis
max_unemployment <- aggregate(Value ~ SEX + GEO + AGE, 
                              avg_unemployment[avg_unemployment$year >= 2007, ], 
                              max, 
                              na.rm = TRUE)
names(min_unemployment) <- c("SEX", "GEO", "AGE", "min_unemployment")
names(max_unemployment) <- c("SEX", "GEO", "AGE", "max_unemployment")

Almost there. Now I only have to normalize...

euro <- merge(euro, min_unemployment)
euro <- merge(euro, max_unemployment)
euro$norm_unemployment <- euro$Value / euro$min_unemployment

... and plot. For the plotting part I am going to reorder the GEO factor so that countries with highest unemployment after the crisis are going to be shown first, so the ordering in the following panels won't be alphabetical (those can be easily generated by commenting the reorder function in the code chunk below) but will have some meaning instead. I am using the country as the faceting factor, so each country will go to its own panel, because I think it helped make sense of this bunch of data (consider that we also have sex and age as pieces of information we want to represent). I am also going to leave the y axis move freely because we have big differences in scale for the different countries. So, here it is:

# Compute the ratio between the minimum and the maximum
euro$ratio <- with(euro, max_unemployment / min_unemployment)
euro$GEO <- reorder(euro$GEO, -euro$max_unemployment, min)

euro$group <- paste(euro$SEX, euro$AGE)

euro_total <- euro[euro$SEX == "Total" & euro$AGE == "Total", ]
euro_partial <- euro[euro$SEX != "Total" & euro$AGE != "Total", ]

plt1 <- ggplot(euro_partial) + 
        geom_line(data = euro_total, aes(x = date, y = Value),
                  color = "black", alpha = 0.2, size = 2) +
        geom_line(aes(x = date,
                      y = Value, 
                      linetype = AGE,
                      color = SEX,
                      group = group), size = 1) +
        facet_wrap(~ GEO, ncol = 5, scales = "free_y") +
        scale_linetype_manual(values = c(2, 3), name = "Age") +
        scale_color_manual(values = c("red", "blue"), name = "Sex") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        ylab("Unemployment rate [%]\n") + xlab("\nDate") +
        ggtitle("Unemployment rates in Europe, 2002 - 2015\nGray line represents total average\n")

print(plt1)
ggsave(plt1, filename = "/tmp/mean_unemployment.pdf", height = 20, width = 25)

And now let's do the same in order to obtain the normalized rates:

# Order according to the ratio
euro$GEO <- reorder(euro$GEO, -euro$ratio, min)

euro$group <- paste(euro$SEX, euro$AGE)
euro_total <- euro[euro$SEX == "Total" & euro$AGE == "Total", ]
euro_partial <- euro[euro$SEX != "Total" & euro$AGE != "Total", ]

plt2 <- ggplot(euro_partial) + 
        geom_line(data = euro_total, aes(x = date, y = norm_unemployment),
                  color = "black", alpha = 0.2, size = 2) +
        geom_line(aes(x = date,
                      y = norm_unemployment, 
                      linetype = AGE,
                      color = SEX,
                      group = group), size = 1) +
        facet_wrap(~ GEO, ncol = 5, scales = "free_y") +
        scale_linetype_manual(values = c(2, 3), name = "Age") +
        scale_color_manual(values = c("red", "blue"), name = "Sex") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        ylab("Increase in unemployment rates from minimum point before 2007\n") + 
        xlab("\nDate") +
        ggtitle("Increase in unemployment rates in Europe, 2002 - 2015, from minimum levels before 2006\nGray line represents total average\n")

print(plt2)
ggsave(plt2, filename = "/tmp/mean_norm_unemployment.pdf", height = 20, width = 25)

So this is basically it. I am not going to do any interpretation of the visualizations, as some more data would be needed (like how each country measures unemployment, or what kind of contracts are going in there (fixed term, mini-jobs, etc.). I think these visualizations, as they stand now, are a good starting point to start building a complete picture of this issue.

The R code used to generate these plots can be found here.

Update: after several people suggested it, I've generated the plots with a fixed y axis. Here they are: unemployment, normalized unemployment.