Top 100 Song Predictor

The machine learning project that predicts what songs from 2014 made it to the Spotify Top 100

Ally Cody, Kevin Jin, Jaiveer Kothari, and Jeanette Pranin

EECS 349 ‐ Machine Learning ‐ Northwestern University


Music has always had a profound influence on the way the world is shaped. It offers insight into styles, trends, and even relevant topics at that time. It is often ordered and pleasant to listen to, and people from many modern cultures use music as a creative outlet to express themselves. Of course, some types of music and songs are more popular than others. There is no easy way of describing how popular a song will be, because its popularity is often due to how emotional or catchy the song is. Nevertheless, we were curious to see if we could use machine learning techniques to detect correlations between musical features and popularity and ultimately predict which songs will be popular.

In order to simplify this lofty goal, we decided to specifically predict whether a song released in 2014 would make it into the Spotify Top 100 2014, a playlist created by Spotify that collects the 100 most popular songs from the service over the past year. For this task, we used the Spotify and Echonest APIs to collect our data, which included the duration, explicitness, song ID, number of artists, danceability, energy, loudness, speechiness, and tempo of a track. We decided to use both Spotify and Echonest together because the two services partnered in recent years to create a seamless transition between their APIs, meaning we could use the Spotify IDs of all of the tracks to get information from Echonest.


In order to train and test for our classifiers, we used a data set of 1,248 songs from the year 2014, 100 of which are Top 100 or hits and 1,148 of which are misses. We took 70% of this data set on which to train and 30% on which to test. The proportion of hits to misses in both the training and the test sets are equal. We then used Weka to train and test our data sets on different classifiers to see which worked best.

We chose Naive Bayes, K-Star, and J48 as our classifiers in Weka. While J48 gave the highest accuracy and f1-measure, Naive Bayes was best at correctly classifying whether songs were in the Top 100 or not. The most important feature was artist ID, where each artist in the Spotify database has its own unique ID. It is equivalent to the artist name. We believe that the reason why the artist ID was the most important attribute was because if an artist has multiple tracks in the Top 100, that artist would most likely be a popular artist and more likely produce a more popular song.

Our final report, available below, goes deeper into our approach, the problems we encountered, and our results.

Click to view the final report