New data analysis competitions


  • A web scraping tutorial using rvest.

    The purpose of this tutorial is to show a concrete example of how web scraping can be used to build a dataset purely from an external, non-preformatted source of data.


  • The Sublime and Scary Future of Cameras With A.I. Brains.

    There’s a new generation of cameras that understand what they see. They’re eyes connected to brains, machines that no longer just see what you put in front of them, but can act on it — creating intriguing and sometimes eerie possibilities.

    At first, these cameras will promise to let us take better pictures, to capture moments that might not have been possible with every dumb camera that came before. That’s the pitch Google is making with Clips, a new camera that went on sale on Tuesday. It uses so-called machine learning to automatically take snapshots of people, pets and other things it finds interesting.

    Others are using artificial intelligence to make cameras more useful. You’ve heard how Apple’s newest iPhone uses face recognition to unlock your phone. A start-up called Lighthouse AI wants to do something similar for your home, using a security camera that adds a layer of visual intelligence to the images it sees. When you mount its camera in your entryway, it can constantly analyze the scene, alerting you if your dog walker doesn’t show up, or if your kids aren’t home by a certain time after school.

  • China using big data to detain people before crime is committed: report. Hacker News discussion here.

    Called the Integrated Joint Operations Platform, or IJOP, it assembles and parses data from facial-recognition cameras, WiFi internet sniffers, licence-plate cameras, police checkpoints, banking records and police reports made on mobile apps from home visits, a new report from Human Rights Watch finds.

    If the system flags anything suspicious – a large purchase of fertilizer, perhaps, or stockpiles of food considered a marker of terrorism – it notifies police, who are expected to respond the same day and act according to what they find. "Who ought to be taken, should be taken," says a work report located by the rights organization.

  • The City That Remembers Everything (via). This article is a very good summary of the different surveillance technologies that are intruding our daily lives. But, you know, terrorists!

    As the city becomes a forensic tool for recording its residents, an obvious question looms: How might people opt out of the smart city? What does privacy even mean, for example, when body temperature is now subject to capture at thermal screening stations, when whispered conversations can be isolated by audio algorithms, or even when the unique seismic imprint of a gait can reveal who has just entered a room? Does the modern city need a privacy bill of rights for shielding people, and their data, from ubiquitous capture?

  • Boffins baffled as AI training leaks secrets to canny thieves. This is the paper referenced in the article. From the abstract:

    This paper presents exposure: a simple-to-compute metric that can be applied to any deep learning model for measuring the memorization of secrets. Using this metric, we show how to extract those secrets efficiently using black-box API access. Further, we show that unintended memorization occurs early, is not due to over-fitting, and is a persistent issue across different types of models, hyperparameters, and training strategies. We experiment with both real-world models (e.g., a state-of-the-art translation model) and datasets (e.g., the Enron email dataset, which contains users' credit card numbers) to demonstrate both the utility of measuring exposure and the ability to extract secrets.


  • An AI just beat top lawyers at their own game. Hacker News discussion here. As with all results that come from a company that develop a given product, take these with a grain of salt.

    The nation's top lawyers recently battled artificial intelligence in a competition to interpret contracts — and they lost.

    A new study, conducted by legal AI platform LawGeex in consultation with law professors from Stanford University, Duke University School of Law, and University of Southern California, pitted twenty experienced lawyers against an AI trained to evaluate legal contracts.

    Competitors were given four hours to review five non-disclosure agreements (NDAs) and identify 30 legal issues, including arbitration, confidentiality of relationship, and indemnification. They were scored by how accurately they identified each issue.

    Unfortunately for humanity, we lost the competition — badly.

  • Very convenient, right after las link. Replacing Judges with Computers Is Risky.

    Technology cannot replace the depth of judicial knowledge, experience, and expertise in law enforcement that prosecutors and defendants’ attorneys possess. Complete evaluation and determination of whether to hold or release an accused defendant on bail for any particular defendant accused of any specific crime requires every bit of these combined skills.

    Remember: no two cases — no two defendants, victims or pattern of facts — are alike. Many different defendants may be charged with the same penal code violation, but each crime and circumstance is unique. Each individual and case is unique. Each requires human judgment and the vital and very natural emotion of empathy—two things artificial intelligence systems cannot provide. The California Judicial Council has recommended a laboratory approach destined to fail precisely because it cannot take human responses into account.

  • A Data Scientist Was Sick of Seeing Spam on His Facebook so He Built a Fake News Detector.

    Tired of seeing his friends and family sharing questionable content on his Facebook feed, data scientist Zach Estela decided to take action. He built a tool that scans a website’s most recent 100 posts and analyzes it to determine whether it’s fake news, heavily biased, or a legit news source.

  • UK's New 'Extremist Content' Filter Will Probably Just End Up Clogged With Innocuous Content. A true anti extremist content system would start flagging Margaret Thatcher's speeches right away, wouldn't it?

    Is such an amazing tool really that amazing? It depends on who you ask. The UK government says it's so great it may not even need to mandate its use. The developers also think their baby is pretty damn cute. But what does "94% blocking with 99.995% accuracy" actually mean when scaled? Well, The Register did some math and noticed it adds up to a whole lot of false positives.

    [...] And if it's a system, it will be gamed. Terrorists will figure out how to sneak stuff past the filters while innocent users pay the price for algorithmic proxy censorship. Savvy non-terrorist users will also game the system, flagging content they don't like as questionable, possibly resulting in even more non-extremist content being removed from platforms.

  • Continuing from the last quoted paragraph. Gaming a system. We already knew you could make a killing on Amazon e-books by fooling the algo. In Spain, the main author's right management organization (SGAE) was caught piggy-backing on late night TV. The same scheme is not working on Spotify (via).

  • Palantir has secretly been using New Orleans to test its predictive policing technology. I would say this is parallel to what's happening in China right now, but parallel lines don't cross at some point. Wait for it.

    Predictive policing technology has proven highly controversial wherever it is implemented, but in New Orleans, the program escaped public notice, partly because Palantir established it as a philanthropic relationship with the city through Mayor Mitch Landrieu’s signature NOLA For Life program. Thanks to its philanthropic status, as well as New Orleans’ “strong mayor” model of government, the agreement never passed through a public procurement process.

    In fact, key city council members and attorneys contacted by The Verge had no idea that the city had any sort of relationship with Palantir, nor were they aware that Palantir used its program in New Orleans to market its services to another law enforcement agency for a multimillion-dollar contract.


Data Links is a periodic blog post published on Sundays (specific time may vary) which contains interesting links about data science, machine learning and related topics. You can subscribe to it using the general blog RSS feed or this one, which only contains these articles, if you are not interested in other things I might publish.

Have you read an article you liked and would you like to suggest it for the next issue? Just contact me!