Trends and Classification of Popular Music from 1960-2020

Alex Holtz, Anthony Naritsin, Rose Riggs

Outline

  1. Introduction
  2. Gathering Data
    1. Billboard Hot 100 Web Scraping
    2. Spotify Track ID Collection
    3. ID Selection
    4. Get Audio Features
    5. Get Genre
  3. Exploratory Data Analysis
    1. Exploration of Features
    2. Exploration of Genres
  4. Hypothesis Testing & Machine Learning
  5. Conclusion

Introduction

Despite the fluid nature of popular music, it has always remained an important aspect of people’s lives. Alongside its significance, there are a few questions about music that have remained prevalent. For one, people always want to know what the next hit song will be. Second, how can we define eras of music using our knowledge of the top music of the past? Finally, we tend to classify music in broad terms, but what underlying characteristics define these categories?

We can attempt to answer all of these questions through the data science pipeline. By analyzing the features of past popular songs, we can investigate what made them popular and how this may have changed over time. Given the genres that categorize these songs, we can attempt to justify these classifications using the characteristics of the music itself.

Part 1: Gathering Data

There are many resources that provide datasets that have already been collected and formatted: Kaggle, Google Public Data Explorer, and The U.S. Government Database, just to name a few. However, we wanted to explore popular songs with their features over a long period of time, and we settled upon the Billboard Hot 100 Songs Year End Charts, which were not well represented in pre-existing datasets. We chose this data because it would allow us to investigate our two main ideas: how popular music changes over time and can we classify popular music.

Data can be difficult to find in a format that is standard and that provides everything you are looking for. In our case, the Billboard Charts provide the rank, title, and artists, but lack other characteristics of the music. To accommodate for this we utilized the Spotify API to wrangle a variety of other features as well as genre.

Throughout this tutorial we will be using Pandas, Numpy, and Matplotlib to store, manipulate, and visualize our data.

Part 1.1: Billboard Top 100 Web Scraping

To get information on popular songs for each year, we decided to use the Billboard Hot 100 Songs Year End Charts. Initially we looked at the Billboard website, but the website only goes back to charts from 2005, and our goal was to span a much larger time frame than just 15 years. We eventually decided on using the data from Wikipedia, which includes lists of the Billboard Hot 100 songs for most years back to 1946. After 1960, there were no missing years, so we decided to work with the time frame of 1960-2020.

In order to get the songs from Wikipedia, we scrape the information from the webpage. For this we are using Requests and BeautifulSoup. Requests is a python library for making HTTP requests, which we use to get the HTML content of the webpages. BeautifulSoup is a library which allows us to search within the HTML and parse it to get the data we are looking for.

We are using pandas to store and manipulate our data, as it allows us to store relational data within memory, and provides helpful functions for reading in and exporting data. Each year is a separate page on Wikipedia, but we found that the format of the URL follows the pattern "https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of\_*[year]*". Iterating over every year in the period we chose, we use the requests library to get the HTML content of the page, then use Beautiful Soup to parse the string into an object. Using our browser developer tools to select the table element with the songs, we see that the table uses the class "wikitable". With this information, we can select the proper table and read it into a pandas DataFrame.

After gathering all of the songs into one DataFrame, we display some information about the DataFrame so we can check that we have all of the data we expect. If we had 100 songs for each of the 61 years we looked at, we would have 6100 rows in the DataFrame. Instead, we have 6101. Additionally, the "Rank" column has the data type object, even though every element should be an integer. Exploring our data further, we found that both problems have the same source. One rank is listed as 'Tie', which causes there to be more than 100 rows for the year 1969, as well as prevents the column from having type int64. Going to our orginal data source, we can confirm that this is a tie for the rank of 100 in 1969. In order to resolve the type problem, we reassign the rank for this row to 100, then cast the column to have type int64.

Additionally, when we display the first 5 rows of our data, we can see that the song titles are surrounded by quotation marks, which could cause problems for our search later on, so we will remove them now.

Periodically, we will be saving our current DataFrame to a .csv file. We decided to do this because the nature of our data collection, making many HTTP requests, has a high risk of uncontrollable errors, including server time-out, API rate-limiting, etc. This type of data collection can also be fairly slow. Saving the data to an external file allows us to access the data at a later time without rerunning the code, and also helps mitigate the risk of losing all of the previously collected information due to an unexpected error.

Part 1.2: Spotify Track ID Collection

Next we want some more detailed information about each of the songs. The way we chose to get this information is using the Spotify API. In order to access the endpoints we want to use, we will use the Spotipy library, which provides a wrapper for the Spotify Web API. In order to get any information about the tracks from Spotify, we need the track ID. We will be getting this using the Search endpoint of the API, which allows us to input a search query, and will return a number of matching tracks.

Anonther option for getting this information is the MusicBrainz Database. We decided to use Spotify instead because Spotify includes an endpoint to access an audio analysis for each song, while MusicBrainz is more focused on the releases of the songs, rather than the sound of the song.

The names of songs and artists as they are listed in our DataFrame is not ideal for searching. Many tracks in Spotify don't actually list all of the artists, so we split the string in the "Artist(s)" column on words/characters like "featuring", "and", "&", and ",", keeping only the first artist, as this is usually the artist associated with the album the song is in, and thus, the artist associated with the song on Spotify. Additionally, characters like "'" appear in many song names, but can cause problems with the search query, so we remove them. Also, some songs have names with a "/", such as "We Will Rock You / We Are the Champions". Testing out this type of search, we got the most accurate results removing everything after the "/", so we do that here.

Now that the artist and song names are more appropriate for our search query, we add a column, "Query", which formats the query as necessary for the Spotify API. Formatting the query, rather than just adding the song title and the artist name to the search, will give us only results where the song and artist match well. This is likely to give us more missing data, if a song name is slightly different in Spotify, but it prevents false matching of songs.

In order to access the Spotify API, you need to create an application in the Spotify Developer Dashboard. This gives the Client ID and Client Secret needed for authentication in the HTTP requests. Our credentials are saved in a file config.json to avoid exposing the credentials to the internet. Here, we use these credentials to create a spotipy object.

In order to search for each track, we will be adding information to an existing object. This function gets the data from the track object from Spotify, and adds it to a row object that we can add back to a DataFrame.

Initially, we used only the query described above, but we found that there were many missing songs. When performing data collection and wrangling, it is important to consider how to handle missing data. In this case, we can try a different, more general query. Apply this query when there were no initial results, our results are only missing 31 songs. In this case, we do not have a means of replacing the missing data in a meaningful way, so we drop the missing information. In other cases, missing data may be interpolated using a variety of methods, IBM is a great resource if you wish to learn more about missing data.

Part 1.3: ID Selection

The Spotify Search API used to get the ID information above provides the top 5 results for each track based on the popularity metric. In order to get the audio features, we must select a single one of the results for each song in the top 100 of each year.

The Pandas library allows us to convert our column of strings into datetime objects. This is extremely useful because it allows for a meaningful comparison between the values.

When a release date was not provided, the value was set to "0000". We handle this by setting these values to None before converting to datetime objects.

The song that is most likely to be the version present in the Billboard Hot 100 would be the oldest, as the oldest is closest to the original release date. We additionally considered other methods of choosing which Spotify track to use, including selecting the song with the release date closest to the year of the Billboard Top 100, selecting the song with the oldest release date (if after the Billboard year) or the song with the closest to the Billboard year (if before the Billboard year), etc.

We expect the result to have 6070 entries (6101 total - 31 missing from search), and this matches our results here, so we can export our DataFrame.

In order to see the performance of this selection, we view the summary statistics for the differences between the release year and the year the song appeared on the Billboard Top 100.

The mean of 2.13 supports our idea that the oldest result it is likely closest to that of the Billboard entry; The statistic means that, on average, the selected spotify listing of the track has a release date 2 years after the song appeared on the Billboard Top 100. There is however a standard deviation of 10 years; Although this may seem significant, other selection processes we tested performed similarly if not worse.

Overall, the selection process we used not only performed well in comparison to other methods, but also tried to select the most original version of the song.

Part 1.4: Get Audio Features

We chose to query Spotify because the API provides audio features that numerically describe music. There are two types of data, categorical and quantitative. For the purposes of viewing trends and later classifying the data, it is interesting and useful to have a mixture of both types. Music is often divided into groups based on subjective factors; The Spotify audio features provide objective (based on Spotify's definitions) features about the songs.

When performing the search below, we found that the ID Spotify provided for the rank 80 song from 2005 does not work when searching for audio features. As a result we remove this row because we do not have the features it is associated with (similar reasoning to the removal of missing data described previously).

The Spotify Audio Features API allows for a maximum of 100 song IDs to be provided on each search. In order to accommodate this, we group by rank which will be a maximum of 62 songs per query.

As seen in the info above, the result of the audio features query contains 6069 songs (6070 from previous data - 1 removed for missing data), and there are no columns that contain null values.

Part 1.5: Get Genre

Another feature we may want to look at is genre. However, the Spotify API has no way of directly obtaining the genre of a particular track. Instead, we must obtain the genre through the Artist ID. Since we may have several artists on the same track, we will choose to take the primary artist, whose genres should be just as, if not more, indicative of the genre of the track.

When deciding how to use a list of artist IDs to get genres, we have three options: use only one artist, use the intersection of the genres from all the artists, or use the union of the genres from all the artists. Finding the union or the intersection from all the artists would greatly increase the number of API requests we need to make, and would take more computation. When multiple artists collaborate on a song, they are often associated with similar genres, so it would have a minimal effect on the track's final genre category. Although it is likely to make little difference, the majority of the artist lists begin with the primary artist of the track. Therefore, our best option is to use only one artist, the first one in the list.

We take the list of genres returned and convert them into a string for storage and further processing.

As seen in the output above, the issue with the genre data is that the string can contain many different genres. In order to perform exploratory data analysis on genre, we want to place most of the songs into singular, meaningful categories. That means that we must convert entries like "dance pop, pop, post-teen pop" into just "pop" and entries like "contemporary country, country, country road, modern country rock" into just "country". Our strategy was to use key phrases like "pop" and "rock" to place songs into categories. The issue is that many songs have overlapping genres such as "dance pop, pop, pop rap, r&b, rap". Thus, we prioritized categories according to their prevalence. For instance, since "disco" was rarer than "pop", a string with "disco, dance pop" would be placed in the "disco" category. Since rock and pop had equivalent prevalence, we would choose one or the other based on the number of occurrences of the keyword. So "dance pop, teen pop, rock" would be classified as pop. Doing this, we condensed all of the genres into 8 categories: country, disco, edm, soul/r&b, hip hop/rap, alternative/indie, rock, and pop.

Now that we have placed each song into a genre, we can continue on to exploratory data analysis.

Part 2: Exploratory Data Analysis

After collecting data, either from a premade dataset or a scraped one (like above), the next step is exploring what you have collected. After collecting data, either from a premade dataset or a scraped one (like above), the next step is exploring what you have collected. Exploratory data analysis (EDA) includes analyzing a dataset to determine the main characteristics. Think about what relationships you can visualize from the data. Doing so can illuminate paths to follow further or even show a direct relationship that can be better isolated.

We will be using Seaborn and MatplotLib as tools for data visualization. The documentation for these libraries are great resources for learning what types of graphs you can make to visualize possible relationships.

Part 2.1: Exploration of Features

For the Billboard Hot 100 with audio features dataset that we constructed, we started by exploring the relationship between features and time.

The audio features from Spotify's audio analysis included in our data are:

First, we plotted the average value of each of these features for each year.

The graph above shows the relationship between tempo (bpm) and time. Below we will analyze similar plots of the other features over time as it is often helpful to explore individual trends before exploring their relationships. For the graph above, there does not appear to be a clear relationship between tempo and year a song was on the Billboard chart. If there is any relationship, it appears that the minimum tempo of songs on the top chart has increased fairly significantly between years prior to 2009 and those following 2009.

Here, we graphed loudness over time. This graph shows a trend where popular music increases in loudness over time. This graph peaks in 2010, but we are only plotting the mean loudness. Overall, loudness has a mean of around -8 dB and a standard deviation of around -3 dB, so this trend seems significant. This could be a result of differences in recording quality over time, or it could be representative of the type of music that is popular changing.

This graph shows the average duration of tracks over time. This graph has units of seconds, which is not the best for immediate comprehension of the meaning. For reference, the scale of this graph ranges from 2.67 to 4.67 minutes.

Now we will be plotting all of the features that have the same scale (0.0 to 1.0).

This graph demonstrates some features with minimal change over time, such as instrumentalness and liveness, along with other features with stronger relationships, such as popularity. Given that popularity is a value based on Spotify listening metrics, the upward trend is expected as the songs on the Billboard Hot 100 in more recent years are likely to still be popular today, while older songs are less likely to be listened to today. Another feature with a stronger trend is acousticness. Acousticness of the Billboard Hot 100 songs has decreased over time. There are also features with visible trends that change less overall, including a slight decrease in valence, and slight increases in danceability, energy, and speechiness. The increase in danceability, energy, and speechiness together could be indicative of the rise of certain genres, possibly including hip hop/rap. Since our data only analyzes the most popular songs, these trends may or may not hold for music overall.

After observing trends in your data, it is important to consider what this might lead you to explore next. Our observations here led us to consider trends in genres over time, as well as the relationship between audio features and different genres.

Next, we plotted the pairwise correlation coefficient between each of these features. This type of analysis helps determine how certain features are related to each other.

From this graph, we can see that we have very few strong correlations between different features. The strongest correlation is between loudness and energy, which is to be expected, as the Spotify documentation says that dynamic range and perceived loudness both contribute to the energy metric. Another interesting feature of this visualization is the lack of strong negative correlations. The strongest negative correlation is that between acousticness and energy, but this still has a correlation coefficient of greater than -0.6. Interestingly, acousticness has mostly negative correlations despite these correlations being much less common overall.

Part 2.2: Exploration of Genre

In order to look at how the prevalence of different genres shift over time, we will plot the proportion of tracks in the Billboard Hot 100 that belong to each genre year-to-year.

In the above plot, the 'height' of a colored region (i.e. the distance between the top and bottom) indicates the proportion of the relevant genre. From this, we can see that the proportion of genres on the Billboard Hot 100 changes greatly over time. For instance, we can see that rock constitutes the largest proportion of music from 1960-1990 before quickly disappearing. Hip hop/rap, on the other hand, grows in proportion significantly from 1990 onward. EDM and alternative/indie were non-existent until recently, EDM growing in 2010 and alternative/indie in 2000. Disco constituted a large proportion in the 70s and 80s before subsequently disappearing. Pop was relatively consistent, though became larger recently. Country was prevalent in the 70s and 80s and remained in small proportions onwards. Finally, soul/r&b was a large proportion of the Top 100 in the 60s and 1990-2000, but was small in other periods.

Overall, these patterns seem consistent with the trends of music in pop culture. Considering that different genres likely have distinct audio features, these shifts in proportion may also explain changes in the proportion of audio features. Thus, we will explicitly analyze the distributions of audio features in each genre.

The table above provides a brief insight into how feature values differ across genres. In order to get a better idea of these differences, we create a violin plot for each feature by genre.

Overall, some features vary across genre much more than others. Key, instrumentalness, liveness, duration, tempo, and time signature are mostly similar across all of the genres. Of the features that do change, here are some characteristics worth noting:

In general, violin plots are useful for most features, but are not the best visualization for others (i.e. mode), but the uniformity is useful for comparing how features are different from each other.

Part 3: Hypothesis Testing & Machine Learning

Now to the fun stuff! Using our newly acquired genre data, we are going to attempt to classify songs based on the audio features given by Spotify. We will do this using various classifiers provided by scikit-learn. But first, we have to determine which features we are going to use.

We need to determine which audio features vary between genres. Thankfully, scipy allows us to run hypothesis tests to determine if any differences are statistically significant. An ANOVA test allows us to test variance between many groups, but it assumes that the data is normally distributed between samples. Since we do not want to look at all 96 histograms, we are gonna use the Kruskal-Wallis H Test which does not have the normality assumption.

As we can see from the results, every audio feature produced a p-value less than 0.0001. This means that the probability of all eight genres having the same population mean for each audio feature is very low. Thus, we can reject the null hypothesis that the genres have the same mean for every audio feature. Since we have reason to believe that the audio features differ between genres, we can use them as predictors in our classification models.

For our purposes, we will be using various supervised learning algorithms. This means that our model will be fit to a training set, and then we will test how well the trained model predicts values in the test set. As a baseline, we will be using Multi-class Linear Discriminant Analysis which finds a linear combination of features that characterizes data into classes. It is described in detail here. LDA has no hyperparameters — parameters used to control the machine learning process.

In our confusion matrix, the columns refer to the predicted label and the rows refer to the true label of the observations. Each square shows the number of true/predicted label pairs. For instance, we can see that 280 observations that were predicted as rock were actually rock. We also see that for some heavily populated squares the predicted and true genres do not match up. For instance, for observations that were actually country, 47 were predicted as pop and 116 were predicted as rock. This tells us that the features of some genres may be very similar, which makes sense for country, pop, and rock. It also tells us that we may not have placed songs in the correct genre in our original dataset. To limit the amount of genres that had few or no correct predictions, let's perform the same analysis on only rock, pop, hip hop/rap, and r&b/soul.

Much better! Now we have a good amount of correct predictions for every genre. It still seems that a large number of pop and soul/r&b songs are being misclassified as rock. However, the majority of rock and hip hop/rap songs are correctly predicted. Looking at the classification report, we have three metrics: precision, recall, and f1 score. The precision for a genre measures the ability of the classifier to not measure an observation of another genre as that genre (i.e. more false positives = less precision). The recall for a genre measures the ability of the classifier to find all instances of that genre. F1 score is the harmonic mean of precision and recall. For all three metrics, a score of 1 is the highest. Applying these, we can see that hip hop/rap has the best precision, meaning that few things that are actually not hip hop/rap are classified as hip hop/rap. Rock has the best recall, meaning that most instances of rock are found and labelled as such.

While LDA seems to do fairly well in categorizing songs by genre, it does not give us an idea of what distinguishes the different genres. Another model, the decision tree classifier, can make observations about a song (i.e., ask questions about its audio features) and conclude its genre. Read about the model here. Luckily, scikit has an easy way to visualize the model's decision making process.

Also, an important practice in data science is parsimony. That is, limiting the amount of features you use in a predictive model. Having too many features, especially in a decision tree with a large depth, can contribute to overfitting. As such, we will be only including a subset of features that displayed the smallest p-values in our Kruskal-Wallis Test from earlier.

Another consideration of decision trees is its hyperparameters which can be tuned to produce different results. For a decision tree, some of the parameters to consider are max depth, splitter, the minimum number of samples to split a node, and the minimum number of samples allowed in a leaf node. Splitter refers to the strategy used to split a node, which can be either "best" or "random". Choosing the "best" split can be problematic for overfitting and so often choosing "random" solves this. In general, the default values of scikit work, but for our purposes we will be using a max depth of three. By default, there is no maximum depth which can contribute to both overfitting and performance issues. Moreover, any depth less than three would have trouble classifying our genres.

Each node of the resulting tree shows a decision step in the classifying process. For instance, the first node asks whether the song has a speechiness above a certain threshold. If it does, the song goes to the right branch, if not it goes to the left. Once the song reaches a leaf, it is predicted to be whatever genre that leaf represents. From this, we can make observations like the fact that hip hop/rap tends to be speechy, which makes sense given the presence of spoken word in that genre. Let's test the classifier and see how it compares to LDA.

The result seems comparable to LDA. However, it seems that more songs were classified as soul/r&b, both correctly and incorrectly. It is important to note that a singular decision tree is extremely inconsistent in its results.

For our last classifier, we are going to try a random forest. A random forest classifier essentially works by averaging the results of many decision trees. It is meant to counter the issues of overfitting and inconsistency prevalent in singular decision trees. You can read more about them here.

Random forest classifiers also have a wide variety of hyper parameters to consider. For instance, you can choose the number of trees in the forest. You can also choose any of the same parameters as the decision tree classifier, which will be applied to every tree in the forest. For our purposes, we will be again using the default values in addition to a max depth of three. The default number of trees is 100, which helps us average out the results of many potentially poor decision trees.

Again, the model appears comparable to the other two. However, in this case, it seems to classify many more cases as rock, both correctly and incorrectly. In particular, a large number of soul/r&b and pop songs are classified as rock.

Comparing models through the confusion matrices alone is insufficient. For one, we based our analysis on a single training/test split. The performance of a model can vary greatly split-to-split. Secondly, it is difficult to compare models based on the quantity of actual/predicted pairings alone. How many of each genre were in the test split and what do we care about more, correctly classifying rock samples or soul/r&b?

These questions are answered through cross validation. K-fold cross validation works by breaking the data into K random folds, where each fold represents 1/K of the data. The data is trained on the remaining part of the data and is tested on the 1/K fold. This is done K times, until all of these 1/K-sized samples have been scored. In our case, we are scoring with the f1 macro, which is applicable in multi-class scenarios and is the mean of the per-class f1 scores. As with f1 score, a score of 1 is the highest.

Scikit provides a cross validation function for its models. We will be using 10 folds, so we will test each model on 10 folds, returning the f1 macro for each fold.

Judging by our cross validation procedure, LDA performed the best with the highest mean macro f1 score. Random forest performed the next best and decision tree performed the worst. Our Kruskal-Wallis Test produced a very small p-value, meaning we can reject the null hypothesis at the 0.05 level of significance that the models have the same mean f1 score. This means that it is very likely that the models perform differently.

In our case, it seems that LDA and the random forest performed fairly similarly, but the decision tree performed significantly worse. This showcases the fact that a singular decision tree tends to have inconsistent results as compared to a random forest. In general, it seems that classification by genre is relatively difficult with the data we have. Specifically, many pop, rock, and soul/r&b tracks seem to have similar audio features. It should be noted that these results are not final, however. One could attempt to tune the hyperparameters or could even try another model. Not all models are equally applicable to the same dataset, so maybe something other than LDA and random forests would be more appropriate.

Part 4: Conclusion

Congrats, you have made it to the end! At the start of this tutorial, we decided to use the Billboard Hot 100 Year-End charts along with more detailed information from the Spotify API to analyze popular music between 1960 and 2020.

Overall, the numerical analysis of music is a challenging task. Music is a form of art and as such it varies drastically and in very expressive ways. As such, it remains difficult to find objective aspects of music that can meaningfully distinguish songs from one another. This may be the reason we so often use subjective categorical means of dividing music. Through the data science process, we have found trends in certain aspects of top music over time, in both specific features and in their overall genre. Despite these historical trends it still remains difficult to predict the future of music due to its fluid nature.

In our analysis, we learned how to scrape Wikipedia, call the Spotify API, and rearrange the data into a usable form. We also learned how to perform exploratory analysis and to leverage machine learning to classify data based on a set of features. But, we have yet to teach you the most important step of data analysis: repeating all the other steps! In fact, there were many points at which we could have gone in several directions.

In our case, we produced several models that could classify between 4 types of music using a subset of audio features to varying degrees of efficacy. We found that, given the data we had, classification is difficult. While many genres had features that made them distinguishable from others, such as the high speechiness of hip hop/rap, other genres, like country, were largely indistinguishable. This could either be due to the inherent similarities between genres, a product of the features we had access to, or a flaw in our process of placing tracks into categories. Regardless, the data and methods we chose had a profound effect on the results we had.

Perhaps our data, featuring all Billboard Hot 100s since the 1960s, was better suited towards time series analysis. We may have produced better results with data geared towards classification, such as taking songs from a variety of curated playlists by genre. Perhaps it was our method of categorizing the data, maybe country should fall in the rock category due to similarities, or maybe we should consider all artists on a track rather than just the primary one. Perhaps genres are inherently indistinguishable by Spotify's audio features. The data collection and analysis we performed throughout our process was a step towards understanding the intricacies of popular music. These further questions that can be answered through additional analysis. Are you up for the challenge?