How to connect R to several different DBs

I spend a lot of time querying different DB systems. If the resulting dataset is big, I query the DB and then dump the result into a CSV file that I then proceed to read. However, in many cases the result is well under several thousand rows; in that case, it might be desirable to run those queries within the script we are working with. In the case of R, which makes exploratory data analysis a breeze, it makes a lot of sense to embed queries into the R Markdown document, so the generated HTML (or PDF, or whatever) report always has the latest data.

With time, I have developed my own set of functions to connect to a bunch of corporate DBs. Some are very easy and straightforward: just load the required R library and you're basically done. Others (Hive, I'm looking at you) are harder, but can be made to work in the end.

This below is a summary of what I use on a daily basis.

SQLite

Probably the easiest one:

library(RSQLite)
mydb <- dbConnect(RSQLite::SQLite(), "/path/to/file.db")
dd <- dbGetQuery(mydb, "SELECT * FROM tablename")
dbDisconnect(mydb)

Ms-SQL (t-SQL)

Also quite easy using its own library in addition to the DBI interface:

library(RSQLServer)
mydb <- dbConnect(RSQLServer::SQLServer(), server = "hostname",
                  properties = list(useNTLMv2 = "false",
                                    user = "username",
                                    password = "password")
)
# dbGetQuery is equivalent to these two lines
req <- dbSendQuery(mydb, "SELECT * FROM TABLENAME")
dd <- fetch(req, -1)
dbDisconnect(mydb)

MySQL

Also, piece if cake, just use the corresponding library:

library(RMySQL)
mydb <- dbConnect(MySQL(), user = "username",
                  password = "password",
                  dbname = "database_name",
                  host = "hostname")
dd <- dbGetQuery(mydb, "SELECT * FROM tablename")
dbDisconnect(mydb)

PostgreSQL

Same as above, with its own package.

library(RPostgreSQL)
mydb <- dbConnect(PostgreSQL(), user = "username",
                  password = "password",
                  dbname = "database_name",
                  host = "hostname")
dd <- dbGetQuery(mydb, "SELECT * FROM tablename")
dbDisconnect(mydb)

Vertica

After trying different things, I got this working by using R's JDBC connector. It's just a bit tricker than any of the DBs above, but not much, once you have your system up and running (yes, rJava, I'm looking at you.)

library(RJDBC)
vDriver <- JDBC(driverClass = "com.vertica.jdbc.Driver",
                classPath = "/path/to/jdbc_driver.jar")
vertica <- dbConnect(vDriver, "jdbc:vertica://hostname:port/dbname",
                     "username", "password")
dd <- dbGetQuery(vertica, "SELECT * FROM tablename")
dbDisconnect(vertica)

You'll need to obtain the proper JAR file for the Vertica version you're using (which you can obtain simply by running SELECT version(); on your server). Please refer to the Vertica docs for more information. Beware: it happened to me that, after loading the wrong version by mistake, R didn't seem to catch up when I tried to load a different JAR. I needed to restart my R session for that.

Hive

I based my solution on this article. Connecting to Hive is similar to Vertica, but a bit trickier, as it needs several additional JAR files in the classpath, and it's not entirely clear to me which ones. What I ended up doing, which was a very brute force approach, was to piggyback on my DBeaver¹ installation, as I already had a connection set up, so the necessary JAR files had to be there. I know that with this solution I am adding a lot of files I won't need, but this works. Another solution is finding and downloading a JAR file for Hadoop / Hive that contains all the required dependencies.

Additionally, you will need to download hive-jdbc-standalone.jar (for your specific version) from Maven directly.

library(rJava)
library(RJDBC)
# Add all JARs to classpath
cp <- sapply(list.files("/path/to/dbeaver-drivers/maven/maven-central",
                        pattern = ".jar", recursive = TRUE),
             function(f) {
                 file.path("/path/to/dbeaver-drivers/maven/maven-central", f)
})
# Add the extra -standalone JAR.
cp <- c(cp, "/ext/libs/hive-jdbc-standalone.jar")
.jinit(classpath = cp)
# The classpath should be correct, we can proceed.
drv <- JDBC("org.apache.hive.jdbc.HiveDriver",
            "/ext/libs/hive-jdbc-standalone.jar",
            identifier.quote="`")
conn <- dbConnect(drv, "jdbc:hive2://hostname:port/dbname", "user", "password")
dd <- dbGetQuery(conn, "SELECT * FROM TABLE")
dbDisconnect(conn)

DBeaver is a multi-DB IDE, and it's basically my endpoint for all DBs I need to use on a daily basis. ↩

There is no comment system. If you want to tell me something about this article, you can do so via e-mail or Mastodon.