A lot of clustering / data exploration tutorials out there use the famous iris
dataset to show how PCA, t-SNE, MDS and other techniques work. Something like
this:
pca1 <- prcomp(iris[, -5])
plot(pca1$x[, 1], pca1$x[, 2], col = iris$Species, pch = 19)
Which is alright, but starts from a premise that is not always true: we know to which class each data point belongs. However, in many real-life problems, the dimensionality reduction step is done precisely because we don't know how data is being clustered together (if at all) and we want to check what is really going on.
Consider the following dataset and transformations:
dataset1 <- data.frame(feature1 = c(rnorm(40, 0, 1), rnorm(60, 4, 1)),
feature2 = c(rnorm(20, 10, 2), rnorm(20, 0, 2), rnorm(60, 5, 2)),
feature3 = rnorm(100, 0, 1),
feature4 = c(rnorm(50, 0, 1), rnorm(50, 5, 1)),
feature5 = c(rnorm(40, 2, 1), rnorm(60, 6, 2)))
library(tsne)
tsne1 <- tsne(scale(dataset1), perplexity = 10)
plot(tsne1[, 1], tsne1[, 2])
We don't know (though we can take an educated guess) how many classes are in
this dataset or which features are useful to separate them. So, though we can
see some very clear clusters, the only thing we know about them is their t-SNE x
and y coordinates. We can now go to our original dataset and obtain a subset of
data (say, those for which the transformation yielded x < -150
, for instance,
and then check what is going on in the featureN
original space.
But the other day I found a graphical way of doing this. It uses plotly (locally) and ggplot. Let's first build the ggplot object for this:
library(ggplot2)
plotdata <- data.frame(tsne_x = tsne1[, 1], tsne_y = tsne1[, 2])
plt1 <- ggplot(plotdata) + geom_point(aes(x = tsne_x, y = tsne_y))
plot(plt1)
Nothing special here, just the normal, static ggplot usual graph (won't even
include it here). What would be cool, though, is having system that allows to
explore data in a visual way: when the mouse pointer is placed over a given
point, the featureN
values are shown on screen. We can do that with plotly and
a couple of extra lines:
library(plotly)
# Let's generate a vector with the info:
hover_text <- apply(dataset1, 1, function(x) {
n <- names(x)
t <- paste(n, x, sep = ": ", collapse = "<br>")
return(t)
}
)
plotdata <- data.frame(tsne_x = tsne1[, 1], tsne_y = tsne1[, 2],
hover_text = hover_text)
plt2 <- ggplot(plotdata) +
geom_point(aes(x = tsne_x, y = tsne_y, text = hover_text))
ggplotly(plt2)
So, what are we doing here? We are creating a hover_text
variable for each row
of our dataset (with the help of the apply
function) that simply prints the
name of each column in the dataset followed by the actual value for that row,
and will separate them using HTML's <br>
tag, which will make plotly present
each variable in a new line. We then build the ggplot object as before, with the
addition of the text
aesthetic (which will produce a warning, but let's not
worry about that) and then, instead of plot
ing it, we use plotly's ggplotly
function.
And we get something like this:
That's it. We can now explore our data (including the original values before the t-SNE transformation) in a visual way, much faster than having to go back and forth to the initial data.frame.