New data analysis competitions

  • Kaggle has published its new March Madness contest. It comes divided in two: NCAA Men and NCAA Women. The objective is to predict the results for the 2018 competition using historical data and any external dataset you find relevant, which may make the entire thing very interesting.



  • The Car of the Future Will Sell Your Data. The future is starting to look like a place in which, even though you pay for the service, you are also a product the company sells. That is now taken for granted when you use something free (Google, Facebook), but you never know where we'll end.

    Automakers have been installing wireless connections in vehicles and collecting data for decades. But the sheer volume of software and sensors in new vehicles, combined with artificial intelligence that can sift through data at ever-quickening speeds, means new services and revenue streams are quickly emerging. The big question for automakers now is whether they can profit off all the driver data they’re capable of collecting without alienating consumers or risking backlash from Washington.

  • Booking Flights: Our Data Flies with Us (via).

    What's in a PNR? Our Passenger Name Records (PNRs) are commonly displayed as six-digit codes, but they are actually data-rich records generated every time we book flights or hotels. In this article we look at what data is included in a PNR and how this data might be used against politically-engaged individuals.

  • The above link allowed me to discover Data Detox, which I think fits very well into this category.

    Welcome to your 8-day data detox! In just half an hour or less per day, you'll be well on your way to a healthier and more in-control digital self. What are you waiting for?


  • Meet the Chinese Finance Giant That’s Secretly an AI Company.

    If you get into a car accident in China in the near future, you'll be able to pull out your smartphone, take a photo, and file an insurance claim with an AI system.

    That system, from Ant Financial, will automatically decide how serious the ding was and process the claim accordingly with an insurer. It shows how the company—which already operates a hugely successful smartphone payments business in China—aims to upend many areas of personal finance using machine learning and AI.

  • 'Automating Inequality': Algorithms In Public Services Often Fail The Most Vulnerable.

    In the fall of 2008, Omega Young got a letter prompting her to recertify for Medicaid.

    But she was unable to make the appointment because she was suffering from ovarian cancer. She called her local Indiana office to say she was in the hospital.

    Her benefits were cut off anyway. The reason: "failure to cooperate."

  • The Geeks Who Put a Stop to Pennsylvania's Partisan Gerrymandering. This presents a nice contrast with the news piece immediately above. There are good social uses for algos. You just need to find a proper human being behind the screen.

    Carnegie Mellon mathematician Wes Pegden had already written an academic paper proving that the Pennsylvania map was drawn with partisan intent. His challenge in the courtroom was to convince a room full of non-mathematicians. So he came armed with an analogy.

    Imagine, Pegden told the court, you’ve touched down in a new city and asked your taxi driver to drop you at any restaurant, something that would give you a sense of the local culinary scene. You give the cabbie a fat tip, go inside the restaurant, and have a terrible meal. Did the driver bring you to a bad restaurant on purpose? Or is it a true reflection of all of the restaurants in the city?

    To answer that question, you could always sample every single restaurant, but that would take too long. A more efficient, but still effective option: test every restaurant immediately surrounding the bad one. If they're all bad, the driver really did pick a representative dining establishment. If they’re all really good? The driver screwed you over.

  • The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation. PDF of the report here. The summary states:

    Artificial intelligence and machine learning capabilities are growing at an unprecedented rate. These technologies have many widely beneficial applications, ranging from machine translation to medical image analysis. Countless more such applications are being developed and can be expected over the long term. Less attention has historically been paid to the ways in which artificial intelligence can be used maliciously. This report surveys the landscape of potential security threats from malicious uses of artificial intelligence technologies, and proposes ways to better forecast, prevent, and mitigate these threats. We analyze, but do not conclusively resolve, the question of what the long-term equilibrium between attackers and defenders will be. We focus instead on what sorts of attacks we are likely to see soon if adequate defenses are not developed.

  • Parkland Conspiracies Overwhelm the Internet's Broken Trending Tools. On an extremely cold assessment, these algorithms are doing their job, which is to detect what is trending so more people can go and click, creating a positive reinforcement loop that lasts until some recency metric is too old and we can move on to the next cool thing. However, this paragraph is misleading:

    YouTube, Facebook, and Twitter all have a section designed to surface the most newsworthy, relevant information in the midst of a vast sea of content.

    These videos are neither newsworthy nor relevant. They're just the most immediate, mob-attracting attention-grabbers.

  • After years of testing, The Wall Street Journal has built a paywall that bends to the individual reader. Hacker News discussion here.

    Non-subscribed visitors to now each receive a propensity score based on more than 60 signals, such as whether the reader is visiting for the first time, the operating system they’re using, the device they’re reading on, what they chose to click on, and their location (plus a whole host of other demographic info it infers from that location). Using machine learning to inform a more flexible paywall takes away guesswork around how many stories, or what kinds of stories, to let readers read for free, and whether readers will respond to hitting paywall by paying for access or simply leaving.


Data Links is a periodic blog post published on Sundays (specific time may vary) which contains interesting links about data science, machine learning and related topics. You can subscribe to it using the general blog RSS feed or this one, which only contains these articles, if you are not interested in other things I might publish.

Have you read an article you liked and would you like to suggest it for the next issue? Just contact me!