• A new Kaggle competition: Truly native?. Dato is sponsoring this competition with the noble goal of making native advertising live up to its name. With a dataset of over 300,000 raw HTML files containing text, links, and downloadable images, they also want to give Kagglers a challenge that encourages creativity. Given the HTML of websites served to users of StumbleUpon, your challenge is to identify the paid content disguised as just another internet gem you've stumbled upon.

  • Tufte in R. If you haven't read his books, this is an excellent moment for doing so.

  • Turns out you can plot a map of New York using taxi routes.

  • Don't like cars? Me neither. Let's go biking. In Seattle. From the article: Last year I wrote a blog post examining trends in Seattle bicycling and how they relate to weather, daylight, day of the week, and other factors. Here I want to revisit the same data from a different perspective: rather than making assumptions in order to build models that might describe the data, I'll instead wipe the slate clean and ask what information we can extract from the data themselves, without reliance on any model assumptions. In other words, where the previous post examined the data using a supervised machine learning approach for data modeling, this post will examine the data using an unsupervised learning approach for data exploration.

  • During the weekend I replicated most of the above post using R (the original was written in Python.) Check it out!.

  • The New Science of Sentencing. From the article: Pennsylvania is on the verge of becoming one of the first states in the country to base criminal sentences not only on what crimes people have been convicted of, but also on whether they are deemed likely to commit additional crimes. As early as next year, judges there could receive statistically derived tools known as risk assessments to help them decide how much prison time — if any — to assign.

  • Hacker News Discussion: The Internet is getting lame. What's next?. It's probably that we are getting old.

  • Nobody ever got fired for using Hadoop on a cluster (PDF, technical report).

  • Spatio-temporal techniques for user identification by means of GPS mobility data (academic paper.)

There is no comment system. If you want to tell me something about this article, you can do so via e-mail or Mastodon.