New data analysis competitions


This post is part of a series covering the exercises from Andrew Ng's machine learning class on Coursera. The original code, exercise text, and data files for this post are available here.


In many ways, the migration of clinical scientists into technology corporations that are focused on gathering, analysing and storing information is long overdue. Because of the costs and difficulties of obtaining data about health and disease, scientists conducting clinical or population studies have rarely been able to track sufficient numbers of patients closely enough to make anything other than coarse predictions. Given such limitations, who wouldn't want access to Internet-scale, multidimensional health data; teams of engineers who can build sensors for data collection and algorithms for analysis; and the resources to conduct projects at scales and speeds unthinkable in the public sector?

Yet there is a major downside to monoliths such as Google or smaller companies such as consumer-genetics firm 23andMe owning health data — or indeed, controlling the tools and methods used to match people's digital health profiles to specific services.

The angrier, more negative tweets from Donald Trump's Twitter account are mostly written by the presidential candidate himself, while campaign staffers are responsible for the calmer announcements and pictures, according to an analysis by a data scientist.

Using Instagram data from 166 individuals, we applied machine learning tools to successfully identify markers of depression. Statistical features were computationally extracted from 43,950 participant Instagram photos, using color analysis, metadata components, and algorithmic face detection. Resulting models outperformed general practitioners' average diagnostic success rate for depression. These results held even when the analysis was restricted to posts made before depressed individuals were first diagnosed. Photos posted by depressed individuals were more likely to be bluer, grayer, and darker. Human ratings of photo attributes (happy, sad, etc.) were weaker predictors of depression, and were uncorrelated with computationally-generated features. These findings suggest new avenues for early screening and detection of mental illness.