My knowledge of history is patchy at best. I've read Judt on the post-war Europe, digested several Preston books on the Spanish Civil War, laid my hands on the two-volume biography of Adolf Hitler (and, by extension, the Second World War) by Kershaw and a bit more. Canada, were I am living now, remains more or less a mystery (save for a lightning-fast history lesson I got thanks to my outstanding French teacher), though I plan on fixing that as soon as I have some free time. The rest of the planet is pretty much a black hole, except for isolated details. I find myself in the same situation as that hypothetical Martian that arrives at Earth on a cold night with no clue what is going on and needs to get up to speed fast. What would I do if I were in his / her / its place?

There has to be some way of trying to understand how the world has changed in the past century that involves programming and data processing in order to get a quick (even if very approximate) understanding of recent history. I started thinking that the registry of votes in the General Assembly of the United Nations could be a good indicator. While there is no good official dataset for this, (this seems to involve too much scraping for a quick solution), these guys have made an outstanding job at providing a comprehensive record.

So, after downloading several files from that site, I have all voting records from the UN General Assembly from 1946 to 2013. If you want all the code in one go, here's the public github repository for this project, but the step-by-step plan is:

  • Read the data, inspect it, and clean it. This will be pretty easy in this case, as the files are already in good shape. Also, as the vote outcome can be encoded in different ways (see the Codebook file for an extended explanation), I will re-encode it to simplify: 1 for yes, 0 for any kind of abstention or non-vote, -1 for no.
# Some library loading prior to this point, please refer to the full code file

# From https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.1/12379
un_data <- read_tsv("~/Dropbox/data/un/Dyadicdata.tab.gz")
# Raw voting data
raw_data <- read_tsv("~/Dropbox/data/un/RawVotingdata.tab.gz")
# Use it for country codes
cc_data <- read_tsv("~/Dropbox/data/un/Idealpoints.tab.gz")
cc_data <- cc_data[, c("ccode", "CountryName")]
cc_data <- cc_data[!duplicated(cc_data$ccode), ]

# Get raw voting data and obtain year and country name
raw_data$year <- 1945 + raw_data$session
# Remove non-members (vote code 9)
raw_data <- raw_data[which(raw_data$vote != 9), ]

# Vote recoding.
# Ternary vote:
# 1  -> Yes
# -1 -> No
# 0  -> Abstain or no vote
raw_data$vote3 <- ifelse(raw_data$vote == 1, 1,
                         ifelse(raw_data$vote == 3, -1, 0))

# Merge voting data
raw_data <- left_join(raw_data, cc_data, by = "ccode")
# Let's suppose I can get rid of country 511 (doesn't have name) and NA.
# Can't find it in the codebook (?)
raw_data <- raw_data[!is.na(raw_data$CountryName), ]
  • With this, I can proceed and compute the cosine similarity between each pair of countries for each year. The output of this mathematical function will be 1 if these countries voted always in the same way, and 0 if they were completely opposed. This is widely used in the construction of recommender systems and yields a pretty good indicator of how similar a given country is to any other. We can also explore temporal changes, but more on this later.
# Generate the cosine matrices. One per year, returned as a list
years <- unique(raw_data$year)
cosined_data <- lapply(years, function(year) {
    cat("Computing cosine metrics for year", year, "\n")
    raw_data_year <- raw_data[raw_data$year == year, ]
    raw_data_year$CountryName <- factor(raw_data_year$CountryName)
    raw_data_year_wide <- dcast(raw_data_year, rcid + vote3 ~ CountryName, fun.aggregate = length)
    cos <- cosine(as.matrix(raw_data_year_wide[, -c(1, 2)]))
    return(cos)
    })
  • If cos is the cosine similarity between any pair of countries, 1 - cos is the dissimilarity, which I will treat as a distance. This distance matrix can be used as the input for t-SNE, which will output an x, y coordinate pair per country to generate cluster maps: the idea is that we can visualize a map of the World in which the geographical positioning is provided by political outputs. This will be something like this:
for (year in years) {
    cat("Processing year", year, "...\n")
    cos <- cosined_data[[year - min(raw_data$year) + 1]] # 1946 -> 1, and so on
    t_year <- tsne(as.dist(1 - cos), k = 2, max_iter = 3000,
                   perplexity = 10, whiten = FALSE)

    t_year <- data.frame(x = t_year[, 1], y = t_year[, 2],
                         country = rownames(cos))

    plt1 <- ggplot(t_year) + geom_text_repel(aes(x = x, y = y, label = country)) +
            geom_point(aes(x = x, y = y)) +
            ggtitle(sprintf("The world in %d", year))
    plot(plt1)
    ggsave(sprintf("/tmp/un_cluster_%s.pdf", year), plot = plt1,
           width = 15, height = 15)
}

That loop above was me being overoptimistic about the perplexity parameter, assuming that it's going to work just the same for all yearly datasets. Anyone who has worked with t-SNE will tell you that this will hardly be the case. As a consequence, when I show clusters later on in this post, please have in mind that each one of them will surely have been produced with slightly different values. For a very comprehensive introduction to this topic, please read this excellent piece.

  • Also, as I am interested in seeing how different countries have modified their relations with time, I will have a function to extract precisely that so it can be easily plotted.
get_time_evolution <- function(cosined_data, country1, country2) {
    # Returns the evolution of the "agreement" between country1 and country2
    # through the different years. If one year does not contain data, put NA
    # in place.

    offset <- 1945 # year offset
    c_str <- paste(country1, country2, sep = " - ")

    agreement <- sapply(cosined_data, function(m) {
        # m is a matrix with the proper row / column names. Let's try to
        # find the requested countries in there.
        country1 <- which(grepl(country1, rownames(m)))
        country2 <- which(grepl(country2, colnames(m)))
        if (length(country1) == 1 && length(country2) == 1) {
            return(m[country1, country2])
        } else {
            return(NA)
        }
    })

    # Build data.frame
    res <- data.frame(agreement = agreement,
                      countries = c_str,
                      years = 1:length(cosined_data) + offset)
    return(res)
}

Ok, we are done. Let's take a look at the world in 1948, which is a good example of the final result.

The world in 
1948

Seems to make sense. There is one very clear cluster of mostly South American countries, a "Western" bloc, a group of mostly arab countries and the Communist bloc. x and y have no special meaning in this plot, so they can't be directly read as social or economic axes.

Also, have in mind that the dataset is trying to keep a certain consistency with country names across time. So, Russia is Russia in 2013 and in 1948.

But 1948 had very few countries as UN members. We can advance a bit more, to 1966, for instance. This is what it looks like:

The world in 
1966

Clusters are not so clear (there's many more datapoints now), but look close to the center of the image: Libya, Tunisia, Syria, Sri Lanka, Indonesia, India, Morocco... it's a pretty good approximation for a cluster of countries that formed the Non-Aligned Movement (thanks, Antonia, for pointing this out). Then we have the other two typical blocs and some smaller clusters that are more difficult to make sense of.

And, just to finish this part, the world in 2013, which I will not comment on:

The world in 
2013


In any case, these are static images, a flash of what the UN voting record is telling us for each particular year. But it can be more informative to plot the agreement (in our case, the similarity) between any given pair of countries across time. We can do that easily with the utility function I presented above. For example:

ccdata <- rbind(get_time_evolution(cosined_data, "United States of America", "Israel"),
                get_time_evolution(cosined_data, "United States of America", "Iran"),
                get_time_evolution(cosined_data, "United States of America", "United Kingdom"),
                get_time_evolution(cosined_data, "United States of America", "Russia"))

plt1 <- ggplot(ccdata) + geom_line(aes(x = years, y = agreement, color = countries), size = 1.2) +
        scale_color_brewer(palette = "Set1", name = "") +
        scale_x_continuous(breaks = seq(min(ccdata$years), max(ccdata$years), 5),
                           minor_breaks = seq(min(ccdata$years), max(ccdata$years), 1)) +
        theme(legend.position = "bottom", legend.direction = "vertical")
plot(plt1)

US agreement with several 
countries

I have added cosined_data.Rda to the repository so you can directly load it and use this utility function to plot whatever you like.


As usual with these projects, there are many caveats worth mentioning. Mainly, this is a very (very!) rough simplification. The votes have been encoded in a way that is probably losing useful information and, more important, the UN General Assembly votes will not include all the subtleties and nuances of international politics. Also, one must take into account that I am computing the similarity using the complete voting records for a year, but a clustering done after separating the votes in different sub-issues (for instance, environmental, of Palestine) will probably yield more fine-tuned results. It's good for having fun, but one should be careful when interpreting the results.

In any case, if you play around with this dataset and find something interesting, drop me a line!.


References

  • Voeten E, Strezhnev A, and Bailey M, United Nations General Assembly Voting Data, Harvard Dataverse, 2016. DOI: 1902.1/12379.

  • Also, check these two articles [1, 2] for a different approach.