A lot of clustering / data exploration tutorials out there use the famous iris dataset to show how PCA, t-SNE, MDS and other techniques work. Something like this:

pca1 <- prcomp(iris[, -5])
plot(pca1$x[, 1], pca1$x[, 2], col = iris$Species, pch = 19)

Simple PCA 
plot

Which is alright, but starts from a premise that is not always true: we know to which class each data point belongs. However, in many real-life problems, the dimensionality reduction step is done precisely because we don't know how data is being clustered together (if at all) and we want to check what is really going on.

Consider the following dataset and transformations:

dataset1 <- data.frame(feature1 = c(rnorm(40, 0, 1), rnorm(60, 4, 1)),
                       feature2 = c(rnorm(20, 10, 2), rnorm(20, 0, 2), rnorm(60, 5, 2)),
                       feature3 = rnorm(100, 0, 1),
                       feature4 = c(rnorm(50, 0, 1), rnorm(50, 5, 1)),
                       feature5 = c(rnorm(40, 2, 1), rnorm(60, 6, 2)))

library(tsne)
tsne1 <- tsne(scale(dataset1), perplexity = 10)
plot(tsne1[, 1], tsne1[, 2])

t-SNE plot with synthetic 
data

We don't know (though we can take an educated guess) how many classes are in this dataset or which features are useful to separate them. So, though we can see some very clear clusters, the only thing we know about them is their t-SNE x and y coordinates. We can now go to our original dataset and obtain a subset of data (say, those for which the transformation yielded x < -150, for instance, and then check what is going on in the featureN original space.

But the other day I found a graphical way of doing this. It uses plotly (locally) and ggplot. Let's first build the ggplot object for this:

library(ggplot2)
plotdata <- data.frame(tsne_x = tsne1[, 1], tsne_y = tsne1[, 2])
plt1 <- ggplot(plotdata) + geom_point(aes(x = tsne_x, y = tsne_y))
plot(plt1)

Nothing special here, just the normal, static ggplot usual graph (won't even include it here). What would be cool, though, is having system that allows to explore data in a visual way: when the mouse pointer is placed over a given point, the featureN values are shown on screen. We can do that with plotly and a couple of extra lines:

library(plotly)
# Let's generate a vector with the info:
hover_text <- apply(dataset1, 1, function(x) {
    n <- names(x)
    t <- paste(n, x, sep = ": ", collapse = "<br>")
    return(t)
    }
)
plotdata <- data.frame(tsne_x = tsne1[, 1], tsne_y = tsne1[, 2],
                       hover_text = hover_text)
plt2 <- ggplot(plotdata) + 
    geom_point(aes(x = tsne_x, y = tsne_y, text = hover_text))
ggplotly(plt2)

So, what are we doing here? We are creating a hover_text variable for each row of our dataset (with the help of the apply function) that simply prints the name of each column in the dataset followed by the actual value for that row, and will separate them using HTML's <br> tag, which will make plotly present each variable in a new line. We then build the ggplot object as before, with the addition of the text aesthetic (which will produce a warning, but let's not worry about that) and then, instead of ploting it, we use plotly's ggplotly function.

And we get something like this:

plotly hover text 
example

That's it. We can now explore our data (including the original values before the t-SNE transformation) in a visual way, much faster than having to go back and forth to the initial data.frame.

There is no comment system. If you want to tell me something about this article, you can do so via e-mail or Mastodon.