As I have mentioned before, I like running queries directly from R, so I can analyze the results directly. One downside of this is that sometimes there are queries that take a few minutes to finish, so I would be better saving the result and reading it from disk, but if I have to modify anything in the SQL syntax I also have to replace the previous file with the results. That defeats the purpose of running the entire thing from R.
So I have resorted to something very simple but effective:
library(digest)
get_db_or_cache <- function(query, get_db_function,
cachefolder = '/tmp/', verbose = FALSE) {
hash <- sha1(query)
filepath <- file.path(cachefolder, sprintf("%s.Rda", hash))
if (file.exists(filepath)) {
load(filepath)
if (verbose) {
cat(sprintf("Data loaded from %s\n"), filepath)
}
} else {
res <- get_db_function(query)
save(res, file = filepath)
if (verbose) {
cat(sprintf("Data saved to %s\n", filepath))
}
}
return(res)
}
In the above example, get_db_function
is any of the functions that I mentioned
in the article that I linked above. For example:
get_vertica <- function(query) {
vDriver <- JDBC(driverClass = "com.vertica.jdbc.Driver",
classPath = "/path/to/jdbc_driver.jar")
vertica <- dbConnect(vDriver, "jdbc:vertica://hostname:port/dbname",
"username", "password")
dd <- dbGetQuery(vertica, query)
dbDisconnect(vertica)
}
The query will run the first time I use it, and as long as I don't modify it I'll get the same results immediately. This assumes that this is an intended effect, which may not be true in all cases. Some modifications can be made (for instance, return the cached result only if it's not older than N hours), but you get the idea.