An Exploration by Rebeca Guillen (riguillen) , Astha Singhal (hackerman084) , and Michelle Tu (vance02)
As we're typing this introduction, we're bopping to a random "Pump Up Music" playlist found 5 minutes ago (thank you apache_king!). Music has played and continues to play a very important part of our lives. Some of us have started playing instruments since we were very young, or others have joined every band possible throughout elementary, middle, and high school.
Needless to say, music is important to us. It's how we process our emotions when upset, motivate ourselves when lazy, and push past our limits when working out. The emotions of a generation are forever captured by the songs of that time. Music defines the decades, like the psychedelic rock of the Beatles dominating the 60s, the eccentric theatric rock of Queen exploding in the 80s, and NSYNC embodying the odd time that was the 90s.
We wanted to investigate the potential changes of "pop" music across multiple decaldes, from the 50s through the 2010s. Every generation facing a different sociopolitical landscape, and we wanted to see if that was reflected in their music of choice. By that same token, by analyzing the popular songs of these decades, we want to construct a model that could predict the decade a song was created in based off of its audio features.
Using Spotify and Genius's Web APIs. Spotify's own "All Out" playlists are organized by decade, so we use that as our definition of "pop songs" of that time. Spotify's API also provides information about the features of the track, from danceability to 'speechiness', which we will analyze as use as the features of our model.
The first thing we did was import everything that we needed. Because we were accessing the Spotify and Genius APIs, we use the requests library as well as others to be able to parse the queried information into a dataframe. We will also use the Natural Language ToolKit (or NLTK) to meaningfully analyze text. Lastly, for preliminary machine learning, we also imported different modules within SciKit-Learn.
# Request / Parsing
import requests
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from pandas.io.json import json_normalize
import string
#Visualization and EDA
import sys
# The below 3 statements were only done because we were coding within a Jupyter notebook set up for our class (CMSC320).
# For those following at home, you will likely just need to pip install these in your command-line, instead of within
# the confines of the notebook itself.
!{sys.executable} -m pip install Pillow
!{sys.executable} -m pip install wordcloud
!{sys.executable} -m pip install -U nltk
import matplotlib.pyplot as plt
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
import seaborn as sns
#Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.externals import joblib
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn import metrics
To be able to continue with this project, you need authentification tokens provided through Spotify. We would periodically request tokens from this website : https://developer.spotify.com/console/get-playlist-tracks/?playlist_id=37i9dQZF1DX5Ejj0EkURtP&market=&fields=&limit=&offset= to query. This requires a Spotify account, so setting up one will be necessary. Similarly, Genius's API requires an account as well. We used this website: https://docs.genius.com/#/getting-started-h1 to set up our own accounts.
Afterwards, store the tokens within variables.
# https://developer.spotify.com/console/get-playlist-tracks/?playlist_id=37i9dQZF1DX5Ejj0EkURtP&market=&fields=&limit=&offset=
# use above website to generate token
# https://genius.com/api-clients ==> use this website to generate token
token = "INSERT SPOTIFY TOKEN HERE"
genius_token = "INSERT GENIUS TOKEN HERE"
We used two different endpoints to get the necessary data to analyze the trends over time. The first one was to grab all of the decades playlists provided by Spotify from the 50s to the 2010s. These playlists are curated by Spotify employees, and we use them as a proxy for the most popular / most iconic songs of a specificed decade.
r_10s = requests.get("https://api.spotify.com/v1/playlists/37i9dQZF1DX5Ejj0EkURtP", headers={'Authorization': 'Bearer ' + token})
r_00s = requests.get("https://api.spotify.com/v1/playlists/37i9dQZF1DX4o1oenSJRJd", headers={'Authorization': 'Bearer ' + token})
r_90s = requests.get("https://api.spotify.com/v1/playlists/37i9dQZF1DXbTxeAdrVG2l", headers={'Authorization': 'Bearer ' + token})
r_80s = requests.get("https://api.spotify.com/v1/playlists/37i9dQZF1DX4UtSsGT1Sbe", headers={'Authorization': 'Bearer ' + token})
r_70s = requests.get("https://api.spotify.com/v1/playlists/37i9dQZF1DWTJ7xPn4vNaz", headers={'Authorization': 'Bearer ' + token})
r_60s = requests.get("https://api.spotify.com/v1/playlists/37i9dQZF1DXaKIA8E7WcJj", headers={'Authorization': 'Bearer ' + token})
r_50s = requests.get("https://api.spotify.com/v1/playlists/37i9dQZF1DWSV3Tk4GO2fq", headers={'Authorization': 'Bearer ' + token})
We had to get the data into a form that is more readily understandable. The data returned from the playlist endpoint is in JSON format and many of the important data points are nested deeply within the JSON object. The code below is grabbing the important data from the JSON returned from each call to the API and those nested data points as well, and converting it into multiple dataframes. This shows the results for the 2010s playlist. We then replicated it for all of the other decades.
json_data = r_10s.json()
playlists_10 = json_normalize(json_data["tracks"]["items"], [['track', 'artists']], ['added_at', ['track', 'popularity'], ['track', 'id'], ['track', 'name'], ['track', 'duration_ms']], errors='ignore')
playlists_10 = playlists_10.drop(columns=['external_urls', 'href', 'type', 'uri'])
# To view it!
playlists_10.head()
json_data = r_10s.json()
playlists_10 = json_normalize(json_data["tracks"]["items"], [['track', 'artists']], ['added_at', ['track', 'popularity'], ['track', 'id'], ['track', 'name'], ['track', 'duration_ms']], errors='ignore')
playlists_10 = playlists_10.drop(columns=['external_urls', 'href', 'type', 'uri'])
json_data_00 = r_00s.json()
playlists_00 = json_normalize(json_data_00["tracks"]["items"], [['track', 'artists']], ['added_at', ['track', 'popularity'], ['track', 'id'], ['track', 'name'], ['track', 'duration_ms']], errors='ignore')
playlists_00 = playlists_00.drop(columns=['external_urls', 'href', 'type', 'uri'])
json_data_90 = r_90s.json()
playlists_90 = json_normalize(json_data_90["tracks"]["items"], [['track', 'artists']], ['added_at', ['track', 'popularity'], ['track', 'id'], ['track', 'name'], ['track', 'duration_ms']], errors='ignore')
playlists_90 = playlists_90.drop(columns=['external_urls', 'href', 'type', 'uri'])
json_data_80 = r_80s.json()
playlists_80 = json_normalize(json_data_80["tracks"]["items"], [['track', 'artists']], ['added_at', ['track', 'popularity'], ['track', 'id'], ['track', 'name'], ['track', 'duration_ms']], errors='ignore')
playlists_80 = playlists_80.drop(columns=['external_urls', 'href', 'type', 'uri'])
json_data_70 = r_70s.json()
playlists_70 = json_normalize(json_data_70["tracks"]["items"], [['track', 'artists']], ['added_at', ['track', 'popularity'], ['track', 'id'], ['track', 'name'], ['track', 'duration_ms']], errors='ignore')
playlists_70 = playlists_70.drop(columns=['external_urls', 'href', 'type', 'uri'])
json_data_60 = r_60s.json()
playlists_60 = json_normalize(json_data_60["tracks"]["items"], [['track', 'artists']], ['added_at', ['track', 'popularity'], ['track', 'id'], ['track', 'name'], ['track', 'duration_ms']], errors='ignore')
playlists_60 = playlists_60.drop(columns=['external_urls', 'href', 'type', 'uri'])
json_data_50 = r_50s.json()
playlists_50 = json_normalize(json_data_50["tracks"]["items"], [['track', 'artists']], ['added_at', ['track', 'popularity'], ['track', 'id'], ['track', 'name'], ['track', 'duration_ms']], errors='ignore')
playlists_50 = playlists_50.drop(columns=['external_urls', 'href', 'type', 'uri'])
playlists_80.head()
Now that we have the information about each playlist, we want to grab information for each song inside the playlists. The code below creates a list of all the track ids in the playlists and uses that to call to the second end point we use: the audio features end point.
This end point gives us a lot of information about the more qualitative features of a song, the valence, danceability, and instrumentalness. However, this endpoint will also give us unnecessary information such as the mode, track_href, uri, and analysis url of the song, which is not useful for our analysis. We thus make a bunch of queries, and compile the relevant information into a dataframe.
aggregation_functions = {'name': 'first', 'track.id': 'first', 'track.duration_ms': 'first', 'added_at': 'first', 'track.popularity' : 'first'}
df_new = playlists_10.groupby(playlists_10['track.name']).aggregate(aggregation_functions)
aggregation_functions = {'name': 'first', 'track.id': 'first', 'track.duration_ms': 'first', 'added_at': 'first', 'track.popularity' : 'first'}
grouped_00 = playlists_00.groupby(playlists_00['track.name']).aggregate(aggregation_functions)
aggregation_functions = {'name': 'first', 'track.id': 'first', 'track.duration_ms': 'first', 'added_at': 'first', 'track.popularity' : 'first'}
grouped_90 = playlists_90.groupby(playlists_90['track.name']).aggregate(aggregation_functions)
aggregation_functions = {'name': 'first', 'track.id': 'first', 'track.duration_ms': 'first', 'added_at': 'first', 'track.popularity' : 'first'}
grouped_80 = playlists_80.groupby(playlists_80['track.name']).aggregate(aggregation_functions)
aggregation_functions = {'name': 'first', 'track.id': 'first', 'track.duration_ms': 'first', 'added_at': 'first', 'track.popularity' : 'first'}
grouped_70 = playlists_70.groupby(playlists_70['track.name']).aggregate(aggregation_functions)
aggregation_functions = {'name': 'first', 'track.id': 'first', 'track.duration_ms': 'first', 'added_at': 'first', 'track.popularity' : 'first'}
grouped_60 = playlists_60.groupby(playlists_60['track.name']).aggregate(aggregation_functions)
aggregation_functions = {'name': 'first', 'track.id': 'first', 'track.duration_ms': 'first', 'added_at': 'first', 'track.popularity' : 'first'}
grouped_50 = playlists_50.groupby(playlists_50['track.name']).aggregate(aggregation_functions)
track_ids = df_new['track.id'].tolist()
track_str = '%2C'.join(track_ids)
t = requests.get("https://api.spotify.com/v1/audio-features?ids=" + track_str, headers={'Authorization': 'Bearer ' + token})
track_info_df = json_normalize(t.json()['audio_features'])
track_info_df = track_info_df.drop(columns=['analysis_url', 'mode', 'track_href', 'type', 'uri'])
track_ids_00 = grouped_00['track.id'].tolist()
track_str_00 = '%2C'.join(track_ids_00)
t_00s = requests.get("https://api.spotify.com/v1/audio-features?ids=" + track_str_00, headers={'Authorization': 'Bearer ' + token})
track_info_00 = json_normalize(t_00s.json()['audio_features'])
track_info_00 = track_info_00.drop(columns=['analysis_url', 'mode', 'track_href', 'type', 'uri'])
track_ids_90 = grouped_90['track.id'].tolist()
track_str_90 = '%2C'.join(track_ids_90)
t_90s = requests.get("https://api.spotify.com/v1/audio-features?ids=" + track_str_90, headers={'Authorization': 'Bearer ' + token})
track_info_90 = json_normalize(t_90s.json()['audio_features'])
track_info_90 = track_info_90.drop(columns=['analysis_url', 'mode', 'track_href', 'type', 'uri'])
track_ids_80 = grouped_80['track.id'].tolist()
track_str_80 = '%2C'.join(track_ids_80)
t_80s = requests.get("https://api.spotify.com/v1/audio-features?ids=" + track_str_80, headers={'Authorization': 'Bearer ' + token})
track_info_80 = json_normalize(t_80s.json()['audio_features'])
track_info_80 = track_info_80.drop(columns=['analysis_url', 'mode', 'track_href', 'type', 'uri'])
track_ids_70 = grouped_70['track.id'].tolist()
track_str_70 = '%2C'.join(track_ids_70)
t_70s = requests.get("https://api.spotify.com/v1/audio-features?ids=" + track_str_70, headers={'Authorization': 'Bearer ' + token})
track_info_70 = json_normalize(t_70s.json()['audio_features'])
track_info_70 = track_info_70.drop(columns=['analysis_url', 'mode', 'track_href', 'type', 'uri'])
track_ids_60 = grouped_60['track.id'].tolist()
track_str_60 = '%2C'.join(track_ids_60)
t_60s = requests.get("https://api.spotify.com/v1/audio-features?ids=" + track_str_60, headers={'Authorization': 'Bearer ' + token})
track_info_60 = json_normalize(t_60s.json()['audio_features'])
track_info_60 = track_info_60.drop(columns=['analysis_url', 'mode', 'track_href', 'type', 'uri'])
track_ids_50 = grouped_50['track.id'].tolist()
track_str_50 = '%2C'.join(track_ids_50)
t_50s = requests.get("https://api.spotify.com/v1/audio-features?ids=" + track_str_50, headers={'Authorization': 'Bearer ' + token})
track_info_50 = json_normalize(t_50s.json()['audio_features'])
track_info_50 = track_info_50.drop(columns=['analysis_url', 'mode', 'track_href', 'type', 'uri'])
#track.name appears to be in the index so we need to reindex it
grouped_50_index = grouped_50.index
grouped_50.index = range(len(grouped_50))
grouped_50["track.name"] = grouped_50_index
grouped_60_index = grouped_60.index
grouped_60.index = range(len(grouped_60))
grouped_60["track.name"] = grouped_60_index
grouped_70_index = grouped_70.index
grouped_70.index = range(len(grouped_70))
grouped_70["track.name"] = grouped_70_index
grouped_80_index = grouped_80.index
grouped_80.index = range(len(grouped_80))
grouped_80["track.name"] = grouped_80_index
grouped_90_index = grouped_90.index
grouped_90.index = range(len(grouped_90))
grouped_90["track.name"] = grouped_90_index
grouped_00_index = grouped_00.index
grouped_00.index = range(len(grouped_00))
grouped_00["track.name"] = grouped_00_index
df_index = df_new.index
df_new.index = range(len(df_new))
df_new["track.name"] = df_index
Now we have two groups of dataframes: the dataframes with each decade's playlist information, and the dataframes with all the song information for each decade. Now we want to merge them on a common columns, which should be the track id column.
data_2010s = pd.merge(df_new, track_info_df, left_on = 'track.id', right_on = 'id')
data_2000s = pd.merge(grouped_00, track_info_00, left_on = 'track.id', right_on = 'id')
data_1990s = pd.merge(grouped_90, track_info_90, left_on = 'track.id', right_on = 'id')
data_1980s = pd.merge(grouped_80, track_info_80, left_on = 'track.id', right_on = 'id')
data_1970s = pd.merge(grouped_70, track_info_70, left_on = 'track.id', right_on = 'id')
data_1960s = pd.merge(grouped_60, track_info_60, left_on = 'track.id', right_on = 'id')
data_1950s = pd.merge(grouped_50, track_info_50, left_on = 'track.id', right_on = 'id')
We now want to add another column named "decade" as our "label" column. This will be important for all of our further analysis after we combine all smaller dataframes into one large dataframe.
data_2010s['decade'] = '2010s'
data_2000s['decade'] = '2000s'
data_1990s['decade'] = '1990s'
data_1980s['decade'] = '1980s'
data_1970s['decade'] = '1970s'
data_1960s['decade'] = '1960s'
data_1950s['decade'] = '1950s'
# In the code below, we combine all the dataframes into one large dataframe
decades_data = pd.concat([data_2010s, data_2000s, data_1990s, data_1980s, data_1970s, data_1960s, data_1950s])
decades_data['energy'].groupby(decades_data['decade']).describe()
# fixing the index
decades_data.index = range(len(decades_data))
decades_data.head()
Now for the fun part! To answer the first part of our question, we need to delve into the data and see if there are any relationships across decades. First, we wanted to visualize each decade to get a better grasp of the data at hand!
Spotify's API gives us a number of features to characterize a track by. To name a few metrics that Spotify tracks for each song, you could find out the level of acousticness, loudness, tempo, and regularity for a particular track. All of these audio features play an essential part in the unique make up for a track and it would be good to start off our data analysis with a big picture of how songs in each decade compares to each other overall. We decided to create radar charts to chart a few key measures that we felt were interesting and could characterize the popular songs from each decade.
Each decade's profile is in the form of a radar chart that gives the average energy, speechiness, instrumentalness, danceability, and liveness for the popular songs of a certain decade. We produced radar charts for each decade starting from the 1950s up to the 2010a, each giving a general overview of how each audio feature's performance makes up a song of a decade in a wholistic evaluation. Below we give a quick rundown on how Spotify defines different audio features.
A Quick Breakdown of the Some Spotify Audio Features:
You can look into other audio measures that Spotify collects for each track here: (https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/)
### Creating a Radar Chart for the metrics of the average song in a particular decade
mean_stats = pd.DataFrame()
# Obtaining mean for each metric for all the popular songs in a decade
mean_stats['energy'] = decades_data['energy'].groupby(decades_data['decade']).mean()
mean_stats['speechiness'] = decades_data['speechiness'].groupby(decades_data['decade']).mean()
mean_stats['instrumentalness'] = decades_data['instrumentalness'].groupby(decades_data['decade']).mean()
mean_stats['danceability'] = decades_data['danceability'].groupby(decades_data['decade']).mean()
mean_stats['valence'] = decades_data['valence'].groupby(decades_data['decade']).mean()
# Standardizing all the metrics so they can be scaled relative to each other in the radar chart
for name, values in mean_stats.iteritems():
mean = values.mean()
std1 = values.std()
mean_stats[name] = (values - mean)/std1
mean_stats
Now, Python itself does not have built in functionalities that create radar charts. Luckily, we were able to create radar charts following this tutortial from Kaggle. We listed it below if you want to explore this mroe!
Draw a Radar Chart with Python in a Simple Way by Chen ShuyaoDraw (https://www.kaggle.com/typewind/draw-a-radar-chart-with-python-in-a-simple-way)
# Creating the radar charts
labels=np.array(['energy', 'speechiness', 'instrumentalness', 'danceability', 'valence'])
colors=np.array(['blue', 'orange', 'green', 'red', 'purple', 'brown', 'pink'])
for i in range(7):
# sets up the axes
stats=mean_stats.iloc[i].values
angles=np.linspace(0, 2*np.pi, len(labels), endpoint=False)
# creates a closed plot
angles=np.concatenate((angles,[angles[0]]))
stats=np.concatenate((stats,[stats[0]]))
fig=plt.figure()
# when adding a subplot, make the polar parameter True to plot the subplot on
# the radial axes
ax = fig.add_subplot(111, polar=True)
ax.plot(angles, stats, '-o', color = colors[i])
ax.fill(angles, stats, alpha=0.2, color = colors[i])
ax.set_thetagrids(angles * 180/np.pi, labels)
ax.set_title([mean_stats.index[i] + " Radar Chart"])
ax.set_yticklabels([])
ax.grid(True)
The final radar charts produced allows us to visualize a profile of energy, liveness, danceability, instrumentalness, and speechiness for the popular songs of each decade. The changes in shape across the decades reveal several distinct similaries and differences in track structure and musical content from the 1950s to the 2010s.
One can immediately group some of the decades in this period together due to the similar shape of their profiles. The shape of 1950s and 1960s profiles are very similar. For both the 1950s and the 1960s, tracks had high instrumentalness and valence, medium speechiness and energy (energy is lower for 1960s), and low danceability. Then in the 1970s, speechiness drops considerably while danceability and energy become more pronounced. The 1970s have high instrumentalness and valence, medium energy and danceability, and low speechiness.
The transition between the 1970s and 1980s shows a significant change in the track structure and audio features of popular songs. There is a drop in instrumentalness and valence and there is an increase in energy and danceability. The 1980s specifically had high energy and danceability, medium instrumentalness and valence, and low speechiness. Medium/low instrumentalness and valence and medium/high energy and danceability will persist for the decades after the 1980s.
The shapes of the radar charts for the 1990s, 2000s, and 2010s are similar. The three decades all have low instrumentalness and valence and high energy and danceability. The 1990s had a low speechiness while the 2000s and 2010s had high speechiness. The profile shape for the 2000s and 2010s are almost indiscernible.
How exactly do we know that a song is popular? Spotify and other radio often curate these lists based on the how often the song was played amongst various media outlets. We imagined the types of occasions that we, as general music consumers, like to play music at and what makes a song so catchy or fitting for singing along to in the car or swaying those hips at dance functions. Perhaps there is some relationship between the danceability, speechiness, and energy of songs across the decades.
We created a scatterplot to compare the danceability, speechiness, and energy of songs from the 1950s to the 2010s. Energy, the intensity of a track, varies along the y axis. Danceability is noted by the hue of the color blue. Light blue means the track is of low danceability while dark blue means the track is of high danceability. Speechiness, the presence of spoken words in a track, is marked by the size of the data point. Bigger data points indicate that the track has more spoken words than tracks marked by smaller data points.
# Creating a scatterplot for 4 variables
cmap = sns.cubehelix_palette(rot=-.2, as_cmap=True)
ax = sns.scatterplot(x="decade", y="danceability",
hue="energy", size="speechiness",
palette=cmap, sizes=(10, 200),
data=decades_data)
ax.legend(loc='upper right', bbox_to_anchor=(1, 0.5, 0.5, 0.5))
Over the decades, the danceability of popular tracks have increased as there is a slight upward movement in the range of danceability for tracks over time. The 2000s stands out as the decade with the most speechiness and energy since its plots have some of the largest and darkest data points. In contrast, the 1960s stands out as the decade with popular tracks that have the lowest energy and speechiness due to the light color and small size of its data points.
Now that we see that there are certain trends forming over time, is there a better way of visualizing and examining them? Here we'll be doing more of a deep dive into the changes that seem to be forming over time!
The following code visualizes the trends in the different aspects of music by displaying the averages of the different audio features over time.
classes = ["1950s", "1960s", "1970s", "1980s", "1990s", "2000s", "2010s"]
decades_data['energy'].groupby(decades_data['decade']).mean().plot(kind='line')
decades_data['energy'].groupby(decades_data['decade']).mean().plot(kind='bar')
plt.ylabel('Energy')
plt.xlabel('Decade')
plt.title('Energy in Songs Over Decades')
plt.show()
decades_data['track.popularity'].groupby(decades_data['decade']).mean().plot(kind='line')
decades_data['track.popularity'].groupby(decades_data['decade']).mean().plot(kind='bar')
plt.ylabel('Popoularity')
plt.xlabel('Decade')
plt.title('Popularity in Songs Over Decades')
plt.show()
decades_data['danceability'].groupby(decades_data['decade']).mean().plot(kind='line')
decades_data['danceability'].groupby(decades_data['decade']).mean().plot(kind='bar')
plt.ylabel('Danceability')
plt.xlabel('Decade')
plt.title('Danceability in Songs Over Decades')
plt.show()
decades_data['liveness'].groupby(decades_data['decade']).mean().plot(kind='line')
decades_data['liveness'].groupby(decades_data['decade']).mean().plot(kind='bar')
plt.ylabel('Liveness')
plt.xlabel('Decade')
plt.title('Liveness in Songs Over Decades')
plt.show()
decades_data['speechiness'].groupby(decades_data['decade']).mean().plot(kind='line')
decades_data['speechiness'].groupby(decades_data['decade']).mean().plot(kind='bar')
plt.ylabel('Speechiness')
plt.xlabel('Decade')
plt.title('Speechiness in Songs Over Decades')
plt.show()
decades_data['instrumentalness'].groupby(decades_data['decade']).mean().plot(kind='line')
decades_data['instrumentalness'].groupby(decades_data['decade']).mean().plot(kind='bar')
plt.ylabel('Instrumentalness')
plt.xlabel('Decade')
plt.title('Instrumentalness in Songs Over Decades')
plt.show()
Here we can more easily see that the instrumentalness of songs seems to be more significantly changing over the decades, taking sharp plunges in the 1970s and the 2010s. Energy seems to be making a slight increase over time, as well as danceability. It seemed like instrumentalness and speechiness have a bit of an inverse relationship, so we created a small line graph below to see if that was the case.
# This is comparing the instrumentalness to the speechiness of the music, to see whether there appears to be an inverse
# relation between the two
test = decades_data['instrumentalness'].groupby(decades_data['decade']).mean().to_frame()
test["liveness"] = decades_data['liveness'].groupby(decades_data['decade']).mean()
test["energy"] = decades_data['energy'].groupby(decades_data['decade']).mean()
test["popularity"] = decades_data['track.popularity'].groupby(decades_data['decade']).mean()
test["danceability"] = decades_data['danceability'].groupby(decades_data['decade']).mean()
test["speechiness"] = decades_data['speechiness'].groupby(decades_data['decade']).mean()
test['duration_ms'] = decades_data['duration_ms'].groupby(decades_data['decade']).mean()
test.plot.line(y=['instrumentalness', 'speechiness'])
plt.xticks([0,1,2,3,4,5,6, 7], test.index.values.tolist())
plt.show()
It appears as though speechiness and instrumentalness are inversely related, especially in decades like the 60s and the 2010s.
Though averages gives us a clue, averages are notoriously weak against outliers. To ensure we weren't drawing erroneous conclusions, we also created violin plots to better visualize the distributions of features over time.
The next image displays the popularity of each of the playlists. This does not represent the popularity of the songs during the corresponding decades but rather the popularity of them today. According to this link: https://www.statista.com/statistics/475821/spotify-users-age-usa/, about 55% of the current Spotify users are between the ages of 18-34. This age group grew up with the music in the 80s, 90s, and 00s, so it makes sense that the more popular playlists are from those decades.
ax = sns.violinplot(x="decade", y="track.popularity", data=decades_data, cut = 0)
The following shows the distribution of energy levels of songs in each of the decades' playlists. As shown in the plot, the energy has increased from 1950 to the 2000s and 2010s.
sns.violinplot(x="decade", y="energy", data=decades_data, cut = 0)
The following shows the distribution of the loudness of each the songs in each of the decades' playlists. The closer the value is to 0 the quieter it is. So as shown in the plot, the loudness has been lower the last 2 decades compared to the 70s and 80s. This makes perfect sense because the 70s and 80s were the times of rock music, which tends to be very loud.
sns.violinplot(x="decade", y="loudness", data=decades_data)
This plot displays the valence of the songs in the decades. Valence decribes the positiveness of a song. As shown from the plot, the valence was much higher during the 1950s compared to today, which tends to range across the range of valence. This makes sense due the the tpes of genres during the 1950s such as rock and roll, doo-wop, pop, swing, which all tend to be very positive songs. Today, there is a range of music that can have different music positiveness depending on the subject the song is about.
sns.violinplot(x="decade", y="valence", data=decades_data)
This plot displays the danceability of the songs. Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. It shows in the plot that the danceability of the songs during each decade has been consistent throughout the last 7 decades.
sns.violinplot(x="decade", y="danceability", data=decades_data)
This plot displays the tempo distribution in the songs for each decade playlist. The overall estimated tempo of a track in beats per minute (BPM). As expected, the 1950s has more songs that had a higher tempo, which is due to the high spirited genres of that time. After that decade, the distribution of the tempo in the playlist tend to be quite consistent.
sns.violinplot(x="decade", y="tempo", data=decades_data)
This plot displays the duration in milliseconds of the songs for each decade. This plot shows that there were many more long songs during the 60s and 70s, which makes sense due to the type of music during the time(like Dark Side of the Moon! Come on!). There were more songs in the 60s and 70s that exceeded the normal average of 3-4 minutes long compared to today where there is rarely a song that exceeds 5 minutes.
sns.violinplot(x="decade", y="duration_ms", data=decades_data)
Overall, the majority of trends of the audio features of the songs in each of the decades did match with what we were expecting. What did surprise us was the popularity of the 80s, 90s, and 00s playlists, because we assumed the 10s would be more popular due to it being the current decade. However, looking at the data on the age group of the Spotify users, it made sense as to why people would want to hear the songs they grew up with. The audio features of the songs in each of the decades' playlists related very much to the genres during the corresponding decade.
Now that we've seen the changes in audio features, maybe there is a change in the words used over time? We thought it would be interesting to compare across decades and see if there is a difference in the prevalence of certain words. To do so, we used the Genius API to be able to get the lyrics of every song in our decades_data dataframe.
'''
This method requests the URL of a song by querying by the song title and artist name
Params: song_title -> string and artist_name -> string
Returns: a string (either a song URL or an empty string if the song couldn't be found)
'''
def request_song_info(song_title, artist_name):
# forming request
base_url = 'https://api.genius.com'
headers = {'Authorization': 'Bearer ' + genius_token}
search_url = base_url + '/search'
data = {'q': song_title.lower() + ' ' + artist_name.lower()}
response = requests.get(search_url, data=data, headers=headers) # making the get request
json = response.json()
remote_song_info = None
#Search resultls
for hit in json['response']['hits']: #using the artist name to make sure we're actually getting the song we want
if artist_name.lower() in hit['result']['primary_artist']['name'].lower():
remote_song_info = hit
break
if remote_song_info: # returning song URL if matched
song_url = remote_song_info['result']['url']
return song_url
else:
return ""
It's not enough just to be able to get the lyrics of a song. If we want to get meaningful data, we will have to strip words like "the", "an", and "it". These are known as "stop words", and luckily, NTLK has a corpus of them. We can use NLTK to add a few stop-words that it doesn't pick up (like chorus, verse, nt, prechorus, and bridge). We can then filter these stop-words from the lyrics.
'''
This method will return an array of the words within a song.
Params: url -> string
Returns: An array of words within the song
'''
def get_cleaned_lyrics(url):
page = requests.get(url)
html = BeautifulSoup(page.text, 'html.parser')
lyrics = html.find('div', class_='lyrics').get_text()
tokens = word_tokenize(lyrics) #Split the lyrics into an array of strings
# Some data-cleaning
tokens = [w.lower() for w in tokens] # convert them all to lowercase
table = str.maketrans('', '', string.punctuation) # remove punctuation from each word
stripped = [w.translate(table) for w in tokens]
words = [word for word in stripped if word.isalpha()] # remove remaining tokens that are not alphabetic
# filter out stop words
stop_words = set(stopwords.words('english'))
stop_words.add("chorus")
stop_words.add("verse")
stop_words.add("nt")
stop_words.add("prechorus")
stop_words.add("bridge")
words = [w for w in words if not w in stop_words] #filter out the stop-words
return stop_words, words
'''
Generates a Word Cloud based on the frequency of words
Param: freq -> Frequency object
Return: Wordcloud object
'''
def generate_frequency_wordcloud(freq):
wc = WordCloud()
wc.generate_from_frequencies(freq)
return wc
'''
Displays a wordcloud
Param: wordcloud -> Wordcloud object, decade -> string
Returns: Nothing, but displays a Wordcloud with the Decade string as the title
'''
def display_wordcloud(wordcloud, decade):
plt.title(decade)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
'''
Displays the frequency of a given dictionary of words in a bar-chart
Params: freq -> dictionary, title -> string
Returns: Nothing, but displays a bar chat with the given string as the title
'''
def display_barchart(freq, title):
x_val = list(range(len(freq.keys())))
fig, ax = plt.subplots(figsize=(15,6))
ax.set_title(title)
sns.barplot(ax=ax, x=x_val, y=list(freq.values())).set(xticklabels=list(freq.keys()))
plt.show()
'''
Given an array of song titles and corresponding artist names, will return an array containing all of the lyrics
for those songs
Params: songs_arr -> list of strings, artist_arr -> list of strings
Return: a list of all of the lyrics for every song found
'''
def decades_song(songs_arr, artist_arr):
lyr = []
for song, artist in zip(songs_arr, artist_arr):
url = request_song_info(song, artist)
#print("URL: "+url)
if url is not "":
stop_words, lyrics = get_cleaned_lyrics(url)
lyr += lyrics
return stop_words, lyr
decades_group = decades_data.groupby("decade") #grouping dataframe by decade
decade_freq = {} # storing all frequencies into a dictionary
for name, group in decades_group:
print("==========")
stop_words, lyr = decades_song(group["track.name"], group["name"])
decade_freq[name] = nltk.FreqDist(lyr) # storing frequency distribution in dictionary
display_wordcloud(generate_frequency_wordcloud(decade_freq[name]), name)
display_barchart(dict(decade_freq[name].most_common(30)), name)
It seems like love will forever be an important feature of songs, no matter the decade, though one thing interesting to note is the frequency of love does seem to be going down over time (from around 115 in the 50s and 200 in the 60s to around 170 in the 2000s (though it shoots right back up again in the 2010s). It's interesting to see some of the shifts in commonality of words over time, but it doesn't seem to reveal anything conclusive about the stylistic differences of the different decades.
We wanted to see whether we could create an accurate classifier based off of the data we collected. We went into it with low expectations, and were unfortunately correct. As we'll explore below, the 2 classifiers we attempted had low accuracy, and by analyzing the data more carefully, it'll be easy to see why it's hard to create an accurate classifier from the information we had.
I decided to remove features like "name" (which refered to the name of the artist) and "track.name" (which refered to the name of the song) because they have nothing to do with the decade of the song itself. I was worried that it may over-train based on the artist as well, and always guess that a song was from the 60s merely because it was by the Beatles, and not because of the qualities of the song in itself. Similarly, I removed the "added_at" column because it refers to when the track was added to the playlist, which has no bearing on when the song was made. I also removed "track.id" because it was duplicate column for the song ID, which was unnecessary to begin with.
# Let's cut down the data to the features I think are relevant
ml_data = decades_data
ml_data = ml_data.drop(columns = ["name", "added_at", "track.name", "track.id"])
cols = ml_data.columns.tolist()
cols = cols[6:7] + cols[:6] + cols[8:] #moving id to be the first column
ml_data = ml_data[cols]
Figuring out the model to create was difficult because there are so many options, but luckily, the Internet had my back, and I found this awesome article: https://blog.statsbot.co/machine-learning-algorithms-183cc73197c which helped me to narrow down my search. I knew I wanted to do supervised learning of some sort because I had labelled data, but realized quickly that I really wanted were algorithms optimized for multi-class classification (because I have multiple labels, one for each decade).
I decided to try out both Random Forests and K-Nearest Neighbors because both seemed to fit well within the bounds of my problem. We felt that songs were similar within decades, so K-Nearest Neighbors fit well into that. Similarly, because we felt there would be distinguishing characteristics between decades, we thought the logic of a decision tree would work well.
To train the Random Forest model, I first one-hot encoded my labels, because my labels were categorical. Then, I split my data into training and testing sets. I scaled my features so that they would be standarized, and then fit a Random Forest Model to it. Lastly, I cross-validated it with a k = 10, and found the overall accuracy of the scores
# One Hot Encoding
y_encoded = pd.get_dummies(ml_data["decade"])
features = ml_data.iloc[:, 1:-1]
features
#Splitting data
X_train, X_test, y_train, y_test = train_test_split(features, y_encoded, test_size = 0.25, random_state = 21)
# Feature Scaling
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Fitting Random Forest Classification to the Training set
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42)
classifier.fit(X_train, y_train)
#Cross Validating
scores = cross_val_score(classifier, X_train, y_train, cv=10, scoring='accuracy')
# Predicting the Test Set Results
y_pred = classifier.predict(X_test)
print("Random Forest Classifier Score is " +str(classifier.score(X_test, y_test)))
Because the accuracy of the random forest model was so low, I wanted to try and figure out why that was the case. I created a residual plot, correlation matrix, and confusion matrix to try and understand the model's low accuracy. Upon looking at the residuals, I noticed how frequently the model predicted incorrectly. The confusion matrix really illuminates that, revealing that the model seemed to frequently guess the 1950s. I also wanted to see if perhaps the data itself was muddied, and looking at the correlations, it seemed to be the case. There was no one predictor of the decade a song was created in, and in fact, many of the correlations were low (below 0.3). This leads me to believe that our dataset is not strong or varied enough for a classifier to work, but I decided to create one last model just to confirm my hypothesis.
'''
Creates a residual plot
Params: y_pred -> list of floats, y_test -> list of floats, title -> string
Returns: Nothing, but shows the residual p
'''
def residuals_plot(y_pred, y_test, title):
fig, ax = plt.subplots(figsize=(15,6))
ax.set_title(title)
ax.set(xlabel="Predicted Decade", ylabel="Actual Decade")
sns.scatterplot(ax=ax, x=y_pred.argmax(axis=1), y= y_test.values.argmax(axis=1))
plt.show()
'''
Creates a graph of a confusion matrix.
Params: cm -> NP Array, classes -> list of strings, title -> string
Returns: Nothing, but shows the confusion matrix
'''
def create_confusion_matrix(cm, classes, title):
print(type(cm))
plt.figure()
sns.set(font_scale=1.4) #for label size
heatmap = sns.heatmap(cm, annot=True, annot_kws={"size": 16})# font size
fontsize = 16
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=fontsize)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=fontsize)
heatmap.set(xticklabels=classes, yticklabels=classes)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
'''
Creates a graph of a correlation matrix.
Params: cm -> NP Array, title -> string
Returns: Nothing, but shows the correlation matrix
'''
def create_correlation_matrix(cm, title):
plt.figure()
fig, ax = plt.subplots(figsize=(10,8))
ax.set_title(title)
sns.heatmap(corr, ax = ax, xticklabels=cm.columns.values,yticklabels=cm.columns.values)
plt.show()
cm = confusion_matrix(y_test.values.argmax(axis=1), y_pred.argmax(axis=1))
create_confusion_matrix(cm, classes, "Confusion Matrix for Random Forest")
corr_data = pd.concat([ml_data, y_encoded], axis = 1)
corr = corr_data.corr()
create_correlation_matrix(corr, "Correlation Matrix for Decades Data")
residuals_plot(y_pred, y_test, "Residuals for KNN")
Following this tutorial: https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/, I pretty much did the same thing for the Random Forest as I did for the KNN. I trained the data and saw the score, and decided to create a graph to see at which value of k the model performed the best. I did this by trying neighbors between 1-50, and graphing how often the model mis-classified a point. This was largely based off of the tutorial, but I found it to be a helpful way of determining the best value of k, which I had arbitrarily determined to be 7 initially.
# training a KNN classifier
knn = KNeighborsClassifier(n_neighbors = 7).fit(X_train, y_train)
accuracy = knn.score(X_test, y_test)
print ("Accuracy of KNN is: " +str(accuracy))
# creating a confusion matrix
knn_predictions = knn.predict(X_test)
create_confusion_matrix(confusion_matrix(y_test.values.argmax(axis=1), knn_predictions.argmax(axis=1)), classes, "Confusion Matrix of KNN")
# creating odd list of K for KNN
myList = list(range(1,50))
# filtering just the odd ones
neighbors = filter(lambda x: x % 2 != 0, myList)
# empty list that will hold cv scores
cv_scores = []
# perform 10-fold cross validation
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
scores = cross_val_score(knn, X_train, y_train, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
# changing to misclassification error
MSE = [1 - x for x in cv_scores]
neighbors = list(filter(lambda x: x % 2 != 0, myList))
# determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print ("The optimal number of neighbors is %d" % optimal_k)
# plot misclassification error vs k
plt.figure()
plt.title("Best Value of K for KNN")
plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()
In both cases, I found the accuracy to be too low, leading me to believe that there is likely something wrong within the data. Our dataset is pretty small, with only around a 100 songs maximum per decade. Perhaps there is no real trend at all between decades of music, and to see real differences, we would need to look in chunks of 20, 30, or even 40 years. This is pretty likely too. Regardless, this was our personal exploration into data science through the thing we all know and love, and had an interesting time doing it too!
With Spotify and Genuis’s Web APIs, we were able to explore the changes in popular music over the decades from the 50s through the 2010s. While we don’t know what events of each decade contributed specifically to transformation of the music landscape, we do know that popular tracks across the decades each have their own set of audio features unique to that decade. We were able to observe the performance profiles of popular tracks of each decade according to the metrics of energy, valence, danceability, instrumentalness, and speechiness and the change in shape of these performance profiles over time. Furthermore, we were able to investigate into the trends of isolated audio features across the decades and the distribution of tracks for a particular audio feature. We explored the relationship between certain audio features that we thought could be related to each other, like the inverse relationship between instrumentalness and speechiness. To see if certain decades focused on different topics and themes, we used the Genius API to explore the frequency of certain words in song lyrics.
We then attempted to create a decades classifier based on the data we have collected and the trends of the metrics we have observed. We predicted that our classifier would have low accuracy and we were correct! A piece of advice for consideration for anyone else attempting to replicate a classifying for music over the decades would be to obtain a more varied and stronger dataset with better predictors in order to reduce the number of false positives and negatives contributing to our low accuracy.
All in all, we found this investigation in data analysis of music through the decades very enlightening and enjoyable! Not only were we able to explore the trends of audio features for popular tracks over time, we were able to practice some of the data analysis techniques like the use of Random Forest Classification versus K Nearest Neighbors and apply them to real life data. We highly encourage anyone else to work with Spotify and Genius’ API to see if further observations and findings can be made!