New data analysis competitions


  • Another week, another AWS bucket is left unattended. This one was big, though: Defense Department Spied On Social Media, Left All Its Collected Data Exposed To Anyone. Original investigation.

    The UpGuard Cyber Risk Team can now disclose that three publicly downloadable cloud-based storage servers exposed a massive amount of data collected in apparent Department of Defense intelligence-gathering operations. The repositories appear to contain billions of public internet posts and news commentary scraped from the writings of many individuals from a broad array of countries, including the United States, by CENTCOM and PACOM, two Pentagon unified combatant commands charged with US military operations across the Middle East, Asia, and the South Pacific.


  • New tool quantifies power imbalance between female and male characters in Hollywood movie scripts.

    The [University of Washington] team used machine-learning-based tools to analyze the language in nearly 800 movie scripts, quantifying how much power and agency those scripts give to individual characters. In their study, recently presented in Denmark at the 2017 Conference on Empirical Methods in Natural Language Processing, the researchers found subtle but widespread gender bias in the way male and female characters are portrayed.

  • This one has been controversial this week. Original: The Ivory Tower Can’t Keep Ignoring Tech. A reply: Cathy O’Neil Sleepwalks into Punditry. The original point:

    These days, big data, artificial intelligence and the tech platforms that put them to work have huge influence and power. Algorithms choose the information we see when we go online, the jobs we get, the colleges to which we’re admitted and the credit cards and insurance we are issued. It goes without saying that when computers are making decisions, a lot can go wrong.

    Our lawmakers desperately need this explained to them in an unbiased way so they can appropriately regulate, and tech companies need to be held accountable for their influence over all elements of our lives. But academics have been asleep at the wheel, leaving the responsibility for this education to well-paid lobbyists and employees who’ve abandoned the academy.

  • The Brutal Fight to Mine Your Data and Sell It to Your Boss. Plus Hacker News discussion.

    On May 23, an email landed in the sales inbox of a San Francisco startup called HiQ Labs, politely asking the company to go out of business. HiQ is a “people analytics” firm that creates software tools for corporate human resources departments. Its Skill Mapper graphically represents the credentials and abilities of a workforce; its Keeper service identifies when employees are at risk of leaving for another job. Both draw the overwhelming majority of their data from a single trove: the material that is posted—with varying degrees of timeliness, detail, accuracy, and self-awareness—by the 500 million people on the social networking site LinkedIn.

  • Meet The Spreadsheet That Can Solve NYC Transit (and the Man Who Made It). True Data Science, but not sexy.

    The spreadsheet is called the Balanced Transportation Analyzer, or BTA. It has 72 separate worksheets, many of which contain over a thousand rows and dozens of columns. Komanoff made the spreadsheet for a single purpose: to be the most comprehensive accounting possible of how a congestion charge in Manhattan would affect New York City.


Data Links is a periodic blog post published on Sundays (specific time may vary) which contains interesting links about data science, machine learning and related topics. You can subscribe to it using the general blog RSS feed or this one, which only contains these articles, if you are not interested in other things I might publish.

Have you read an article you liked and would you like to suggest it for the next issue? Just contact me!