TL;DR: I did a very simple analysis on Spotify data to try to find the saddest and happiest songs out there. I created playlists with the results: saddest songs [Spotify, YouTube], happiest songs [Spotify, YouTube]. All the code used for this project is available here.
Earlier this year I attended McHacks with a friend. We had never participated in a hackathon before and were curious to see what the whole thing was about. Though we went there with our own project (a sentiment analysis system for tweets about refugees that never worked), we ended up partnering with a team that had another idea in mind that involved a lot of different fields: retrieve a copy of several sad and happy songs, extract their features using signal processing techniques and build a classifier able to predict which songs belong to which mood.
We never completed it, but were very advanced when we had to end the whole thing (it didn't help that we decided to go home to sleep and that I —all my fault— decided to go get a proper brunch the following morning before returning to coding). We were using Spotify's API through spotipy to look for playlists that contained our key terms (sad and happy) and then, using the wonderful youtube-dl library for Python, download the most tagged songs from YouTube in MP3 format. That we managed to accomplish. We ran out of time while we were trying to figure out how to properly use librosa for the feature extraction step, and we never started coding the classifier.
This gave me an idea for a small side-project aimed at answering the following questions: what are the saddest and happiest songs out there? Surely there is enough data in Spotify to give it a try. So this is the idea: get the songs included in all playlists that include the relevant search terms and related synonyms (for happy, for sad) and then check which songs are the ones that appear the most.
For that, as during the hackathon, I used Python with the spotipy library and stored the results in SQLite databases. I took care not to include data from any playlist twice. For the final data analysis I went to R, as I normally do. All the code using for this project can be found on this github repository.
For the sad songs, I used the following terms:
sqlite> SELECT DISTINCT(term) FROM songs; term ---------- sad low down bitter dismal heartbroke sorrow somber sorry gloomy glum grieve hurt troubled weep
I retrieved a total of 1410998 entries (327986 different songs) from 11330 playlists. I specifically didn't include the sad term blue because it is too related to a music genre.
Same thing for the happy songs:
term ---------- happy cheerful delighted ecstatic elated glad joy jubilant thrilled upbeat sunny blissful content
In this case, I got 1483833 entries (358937 different songs) from 8931 playlists.
One idea that comes directly to mind is to compute how many times a given song appears on each set of playlists and use that as an indicator. Let's see how that would work. I loaded the data on R using the RSQLite library using the following query:
SELECT (SELECT COUNT(DISTINCT(uri)) FROM songs) AS total, COUNT(*) AS count, artist, title FROM songs GROUP BY artist, title
So, let's see the top 5 sad songs:
head(sad[order(-sad$count), c("artist", "title", "count")], 5)
artist title count 7382 Adele Hello 2236 147256 Justin Bieber Love Yourself 1969 147296 Justin Bieber Sorry 1609 4336 A Great Big World Say Something 1479 85452 Ed Sheeran Photograph 1395
What about the happy ones?
artist title count 161801 Justin Bieber Sorry 1815 161761 Justin Bieber Love Yourself 1537 161820 Justin Bieber What Do You Mean? 1435 323564 The Weeknd Can't Feel My Face 1392 343933 WALK THE MOON Shut Up and Dance 1334
You can immediately see there is something very fishy here. Justin Bieber's Love Yourself and Sorry are, simultaneously, some of the happiest and saddest songs out there. There are two issues here: 1) people don't always tag in a very coherent way; 2) more importantly, we have to take into account that very popular songs may pop up in our datasets just because they are popular. We cannot do anything to correct for the first problem, but the second one is something that can be solved by getting another dataset: a control one, with playlists obtained with neutral terms that do not have anything to do with the moods that we are trying to isolate. Indeed, I also downloaded that one using:
term ---------- music song top hits favourite like best
With that I downloaded 12311154 entries (1602312 different songs) from 50102 playlists with terms as dull as the ones you just read; my script crashed before retrieving all available playlists, but I chose not to restart it, as I had already gathered more than enough data. I will use this database to control for the popularity of a given song.
What I did now was the following, and I know this is a very rough approximation, but I couldn't think of a better way of doing it: I modeled the data as a binomial distribution, with the number of different playlists like the number of trials and the number of times each song is included as the number of successful events. I used R's
prop.test function for this, as
binom.test was too slow and I my sample was big. I computed the upper confidence interval (at 95 %) for the control and the lower one for each mood, and I then divided those ratios. With that, I want to obtain a conservative estimation of how much more a given song is tagged per mood compared to the general population provided by the control playlists.
As a precaution, and to avoid very spurious results, I removed those songs that were not tagged at least 10 times. If this threshold is modified, the final classification will change, and songs that are more popular will show up, so I've generated two different rankings: one with the previous threshold, and ax extra-popular one with 200. Here's the first one, from sad to happy:
artitle orig_mood label Kim Taylor - Build You Up 11.980122 sad Seahaven - Honeybee 10.572491 sad The Cinematic Orchestra - To Build A Home (feat. Patrick Watson) 9.122422 sad The Story So Far - Navy Blue 8.523226 sad Michael Schulte - You Said You'd Grow Old With Me 8.318033 sad Andrew Belle - In My Veins - Feat. Erin Mccarley 8.186986 sad Ingrid Michaelson - Over You 8.035577 sad Mikelwj - Please Don't Cut 7.478537 sad Birdy - Not About Angels 7.063241 sad Jamestown Story - Goodbye I'm Sorry 6.991294 sad Andrew Belle - Make It Without You 6.862819 sad Daughter - Medicine 6.614898 sad Flatsound - Don't Call Me At All 6.482429 sad Keaton Henson - You Don't Know How Lucky You Are 6.168237 sad Real Friends - Hebron 6.114325 sad Katy McAllister - Another Empty Bottle 6.103101 sad Real Friends - I've Given Up on You 5.930843 sad Hotel Books - Nicole 5.927413 sad Sia - Breathe Me 5.876277 sad Sew Intricate - If You Knew 5.841148 sad All Time Low - Lullabies 5.833872 sad Kodaline - Love Like This - Acoustic 5.828360 sad Keaton Henson - Flesh And Bone 5.693119 sad Britt Nicole - When She Cries 5.564704 sad Citizen - The Night I Drove Alone 5.491908 sad Ingrid Michaelson - Be OK 2.848528 happy CHAPPO - Come Home 2.861424 happy Yuna - Rescue 2.885926 happy The Griswolds - Beware the Dog 2.888610 happy Rusted Root - Send Me On My Way 2.921322 happy JR JR - Gone 2.926596 happy Oh Honey - Be Okay 2.946616 happy BØRNS - Seeing Stars 2.961350 happy Perrin Lamb - Little Bit 2.967613 happy Colbie Caillat - Brighter Than The Sun 2.972332 happy Ray LaMontagne - You Are the Best Thing 3.061297 happy Jimmy Cliff - Wonderful World, Beautiful People - Single Version 3.067851 happy Hunter Hunted - Lucky Day 3.129721 happy Wild Cub - Thunder Clatter 3.142382 happy Morningsiders - Empress 3.173426 happy Vinyl Pinups - Gold Rays 3.184869 happy Sharon Jones & The Dap-Kings - I Just Dropped In To See What Condition My Condition Is In 3.248935 happy Jamie Lidell - Another Day 3.257592 happy Twin Forks - Back To You 3.268261 happy Brett Dennen - Comeback Kid (That's My Dog) 3.540849 happy Oh, Hush! - Happy Place (feat. Hanna Ashbrook) 3.681280 happy The Well Pennies - Drive 3.900415 happy Michael Franti & Spearhead - The Sound Of Sunshine 4.233915 happy Michael Franti & Spearhead - I’m Alive (Life Sounds Like) 4.411475 happy Ezra Vine - Celeste 4.664254 happy
orig_mood column tells how many more times the song was tagged as the given mood with respect to the control dataset. For instance, Kim Taylor's Build You Up is tagged as sad almost 12 times more than it is tagged with the control keywords. The more popular list looks like this:
artitle orig_mood label Birdy - Not About Angels 7.063241 sad Sia - Breathe Me 5.876277 sad All Time Low - Therapy 5.317348 sad Tom Odell - Heal 5.287393 sad Ron Pope - A Drop In The Ocean 4.640795 sad All Time Low - Remembering Sunday 4.460468 sad A Fine Frenzy - Almost Lover 4.407335 sad The Cinematic Orchestra - To Build A Home 4.395410 sad Kodaline - All I Want 4.284281 sad Mayday Parade - Miserable At Best 4.254062 sad Birdy - Tee Shirt 4.247338 sad The Lumineers - Slow It Down 4.232987 sad Alex & Sierra - Little Do You Know 4.193428 sad Mayday Parade - Terrible Things 4.065268 sad Damien Rice - 9 Crimes 3.877873 sad Daughter - Youth 3.858343 sad Justin Bieber - Nothing Like Us - Bonus Track 3.850651 sad Maroon 5 - Sad 3.740168 sad Sam Smith - Not In That Way 3.680314 sad Iron & Wine - Flightless Bird, American Mouth 3.578837 sad Amber Run - I Found 3.570509 sad Haley Reinhart - Can't Help Falling in Love 3.514237 sad Bastille - Oblivion 3.504239 sad Jaymes Young - I'll Be Good 3.497401 sad Birdy - Skinny Love 3.437574 sad Stevie Wonder - Signed, Sealed, Delivered (I'm Yours) 2.040985 happy Jack Johnson - Banana Pancakes 2.135602 happy Bobby McFerrin - Don't Worry Be Happy 2.178093 happy Vampire Weekend - Unbelievers 2.207791 happy Passion Pit - Carried Away 2.212609 happy Daryl Hall & John Oates - You Make My Dreams 2.224245 happy Bleachers - I Wanna Get Better 2.260334 happy Fitz and The Tantrums - The Walker 2.270052 happy Daryl Hall & John Oates - You Make My Dreams - Remastered 2.283607 happy Noah And The Whale - 5 Years Time 2.292561 happy Grouplove - Tongue Tied 2.338586 happy MisterWives - Our Own House 2.359122 happy Matt and Kim - Daylight 2.386265 happy Saint Motel - My Type 2.391831 happy Grouplove - Ways To Go 2.433496 happy Edward Sharpe & The Magnetic Zeros - Home 2.493877 happy MisterWives - Reflections 2.530433 happy NONONO - Pumpin Blood 2.587748 happy The Mowgli's - San Francisco 2.685225 happy BØRNS - Electric Love 2.718853 happy Corinne Bailey Rae - Put Your Records On 2.732802 happy The Mowgli's - I'm Good 2.747544 happy Paul Simon - Me and Julio Down by the Schoolyard 2.796451 happy Rusted Root - Send Me On My Way 2.921322 happy Colbie Caillat - Brighter Than The Sun 2.972332 happy
I've experimented with ways of visualizing these results, but I couldn't come up with anything that looked good; I've left some of my experiments in the R analysis file stored in the code repository, if you want to check what I tried.