Do The Winners Take It All? How Covid-19 Affected Playlist Followers on Spotify

Author:

Tanetpong (Ned) Choungprayoon, Max Pachali, and Hannes Datta

Abstract:

The COVID-19 pandemic has had impacts across various sectors, including the music industry. Despite the challenges posed by the pandemic, including restrictions on live performances, the user base of music streaming platforms has continued to expand. This study investigates the differential effects of the pandemic on the followers of platform playlists curated by various stakeholders, such as Spotify itself, major record labels, and independent labels. Utilizing a unique dataset comprising weekly follower counts from 29 October 2019 to 4 October 2020, we analyze the changes in followers for 39,918 playlists both before and after the pandemic's declaration. Through fixed effect modeling, we assess the pandemic's impact while controlling for genre and visibility on the platform's search page. Our findings indicate that Spotify-curated playlists exhibited the most resilience during the pandemic, with users showing a preference for playlists featuring more popular tracks over those with a higher share of major label content. The analysis also reveals a non-linear effect of playlist popularity, suggesting that while highly popular playlists experienced minimal disruption, the impact intensified for less popular, long-tail playlists. These insights not only reflect the pandemic's heterogeneous effects on the music streaming landscape but also highlight the strategic advantage of popular content curation during times of crisis.

Overview

The purpose of this repository is to publish the empirical research project conducted as part of my doctoral dissertation and to make the analysis part (with R) accessible, potentially inspiring further study. The slides related to this study are also available in this repository. The empirical data is (1) obtained from Chartmetric.com, a third-party company collecting the data via the Spotify Web API stored via AWS and (2) collected from from Everynoise.com/worldbrowser.cgi. This project made use of the Dutch national e-infrastructure with the support of the SURF Cooperative using grant no. EINF-2074. Additional fund for visit Tilburg University for project collaboration is supported by Torsten Söderbergs Stiftelse and Hans Werthénfonden.

slide_deck contains the most updated slides

data contains df_playlists.csv which is simulated data similar to our dataset regarding playlist information and df_weekly.csv (i.e. scanner data) which is simulated data similar to our dataset regarding follower (weekly) information. These two datasets used for the walkthrough and Simulate_Data.R illustrating how these datasets are simulated.

analysis contains Python code (playlist_categorization.py) used for categorizing playlists genre by applying machine learning technique and R code (analysis.R) used for visualizing and analyzing effects of the pandemic on playlists' followers

gen contains results from playlist classification (df_playlists_cluster.csv)

img contains equation and visualization used in the walkthrough

Walkthrough with simulated data

Required package

For this project, we mainly use data.table for data simulation, data aggregation, ggplot2 for visualization, plm for estimating fixed effect (fixest is also good alternative),car for calculating differential effect from curator and playlists percentile. For playlist classification using Python (adapted from this repository), we need pandas , numpy mlxtend os library/module.

Load required package (for R)

library(data.table) 
library(ggplot2)
library(plm)
library(car)

The data

Setting: We simulated playlists' information data (i.e., name, genres, percentile in the platform) and their weekly attributes (i.e., followers, percentage of tracks produced by major label (ML), the number of tracks shared with other playlists). We assume the pandemic declaration is in week $30^{th}$ in our simulated dataset.

Import data

playlist_df <- fread("../data/df_playlists.csv")
playlist_weekly_df <- fread("../data/df_weekly.csv") 
playlist_cluster <- fread("../gen/df_playlists_cluster.csv") #use wide-format table for merging with assigned category

Variable	Description
id	Playlist id
genre	Genre keywords (One playlists can have several genre keywords)
curator	Curator of the playlist (Spotify, Major Label, Indie, Professional, User)
percentile	Playlists position in the platform follower distribution (Top 1%, Top 10%, Top 30& and Tail)
week	Week number
MLShare	The percentage of tracks produced by major label (ML) in the playlist
ntracksotherpl	on average, on how many playlists were tracks on this list added on the platform?
feature	= 1 if playlists being featured in a session on the platform Search Page, otherwise 0

Playlist classification

We adopted Pachali & Datta (2022)'s playlist categorization (i.e. association rule) method to assign our sample playlists to one or more relevant genres.

Using Association Rule Mining from 100 playlists with minimum confidence of 0.9 and minimum support of 0.2, We get 140 total association rules for 12 clusters (plus playlist related to mood) to assign to all playlists

import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import os
    
# ---------------------------- #
#          OBJECTIVE           #
# ---------------------------- #

# Input -> Playlists with one or several genre keywords
# Output -> Association rules using tag genre to map relevant genre for all playlists

# ---------------------------- #
#       DATA PREPARATION       #
# ---------------------------- #

# mood keywords and relevant genres
moods = {'mood', 'fuzzy', 'feel', 'rage', 'anger', 'angry', 'annoying', 'aggresive', 'interest', 'interesting', 'optimism', 'optimistic', 'ecstasy', 'joy', 'serenity', 'love', 'trust', 'acceptance', 'accepting', 'submission', 'terror', 'fear', 'awe', 'amaze', 'amazing', 'surprise', 'surprising', 'distraction', 'distracting', 'grief', 'sadness', 'sad', 'pensiveness', 'pensive', 'remorse', 'loathing', 'disgust', 'boredom', 'boring', 'bored', 'chill', 'active', 'cheerful', 'reflective', 'gloomy', 'humorous', 'humor', 'melancholy', 'romantic', 'mysterious', 'ominous', 'calm', 'lighthearted', 'hope', 'hopeful', 'fearful', 'tense', 'lonely', 'alone', 'happy', 'good', 'bad', 'suave', 'vibe', 'breakup', 'depressed', 'depression', 'emo', 'bored', 'heart broken'}
genres = {'blues','classical','country','edm','dance','funk','hiphop','indie','jazz','pop','rnb','rock'}

# read data
print("import data")
playlists = pd.read_csv('../data/df_playlists.csv') 
playlists_idtags = playlists[['id', 'genre']].drop_duplicates() #In case there is duplicated playlists

# prepare dataset, we need the data that each row has separate genre
def separate_genre_labels(playlist_df):
    playlist_genre = []
    for index, row in playlist_df.iterrows():
        targeted_genre = playlist_df.loc[index, "genre"]
        continue_in_code_step1=False
        if('str' in str(type(targeted_genre))): continue_in_code_step1 = True
        if('float' in str(type(targeted_genre))): continue_in_code_step1 = False
        
        if continue_in_code_step1 == False: continue

        if len(targeted_genre) > 0:
            for genre in targeted_genre.split(","): 
                    playlist_genre.append({
                        "id": playlist_df.loc[index, "id"],
                        "genre": genre.lower().strip()})
    return pd.DataFrame(playlist_genre)

playlists_df = separate_genre_labels(playlists_idtags)

# remove leading and trailing spaces and hyphens
playlists_df['genre'] = playlists_df['genre'].apply(lambda x: x.strip(" ")).apply(lambda x: x.replace("-", " "))

# create a random sample of playlists  for learning
np.random.seed(1) # for reproducbility (every time the same sample and thus identical association rules)
df_ids = playlists_df['id'].unique()
df_sample_ids = np.random.choice(df_ids, 5)
df_sample = playlists_df[playlists_df.id.isin(df_sample_ids)]

# turn data into pivot table as input for association rule mining (rows: playlists, columns: genre)
basket_sample = (df_sample.groupby(['id', 'genre'])['id']
   .count().unstack().reset_index().fillna(0)
   .set_index('id', drop=True))

# --------------------------------------- #
#       FIND MOODS-RELATED PLAYLIST      #
# ------------------------------------- #

def moods_playlists(moods):
    '''identify moods playlists from playlist names (i.e., check if keywords appear in the playlist name)'''
    output = pd.DataFrame()
    
    for tag in moods:
        matches = playlists[playlists.name.str.match(tag + "$|" + tag + "[ ]") == True].copy()
        if 'mood' in moods:
            matches['cluster'] = 'mood'
        output = pd.concat([output, matches]).reset_index(drop=True)
    return output[['id', 'cluster']]

# determine playlist ids related to moods
moods_matches = moods_playlists(moods) 
print("identified " + str(len(moods_matches)) + " moods playlists")


# ---------------------------- #
#   ASSOCIATION RULE MINING    #
# ---------------------------- #

def tag_word_analysis(basket, genre, min_confidence, min_support):
    ''''
    1. select all playlists that contain a given main genre (e.g., pop)
    2. derive association rules of length 2, of which the lift exceeds 1, the confidence is greater than 90%, and the support exceeds a given level
    3. consider association rules of which the consequent is the main genre (e.g., dance pop -> pop)
    '''
    tag_cluster = pd.DataFrame()

    basket_genre = basket[basket[genre] == 1] # all playlists that contain the main genre
    frequent_itemsets = apriori(basket_genre, min_support=min_support, use_colnames=True, max_len=2)
    rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1).sort_values(['confidence', 'support'], ascending=False).reset_index(drop=True)
    rules = rules.loc[rules['confidence'] >= min_confidence, ['antecedents', 'consequents']]

    # restructure frozen set 
    for counter in range(len(rules)):
        antecedent, = rules.loc[counter, 'antecedents']
        consequent, = rules.loc[counter, 'consequents']
        if consequent == genre and antecedent not in genres: 
            tag_cluster_temp = pd.DataFrame([[antecedent, genre], [genre, genre]], columns=['genre', 'cluster'])
            tag_cluster = pd.concat([tag_cluster_temp, tag_cluster])
    tag_cluster = tag_cluster.drop_duplicates()
    
    return tag_cluster
    
# determine association rules within each main genre (note that one tagword may be associated with multiple clusters (e.g., soft rock -> rock & soft rock -> pop))
min_support=0.2 # changing the support level has a major impact on the outcomes (higher support = fewer association rules = fewer clusters)
min_confidence = .90
print("starting association rule mining with minimum confidence of " + str(min_confidence) + " and minimum support of " + str(min_support))
tag_clusters = pd.concat([tag_word_analysis(basket_sample, genre, min_support=min_support, min_confidence=min_confidence) for genre in genres])
print("With association rule mining with minimum confidence of " + str(min_confidence) + " and minimum support of " + str(min_support) + ", We get " + str(len(tag_clusters)) + " total association rules " + "for " + str(len(tag_clusters.cluster.unique())) + " clusters")
print("The final cluster includes " + str(tag_clusters.cluster.unique()))


# ------------------------------- #
#  ASSIGN PLAYLISTS TO CLUSTERS   #
# ------------------------------- #

# add clusters to all playlists df
df_cluster = pd.merge(playlists_df,tag_clusters, left_on='genre', right_on='genre')[['id', 'genre', 'cluster']]

    
# calculate number of playlists without any cluster (because none of the tagwords are related to any of the association rules; choosing a lower minimum support level will remedy this problem to a certain extent)
all_playlists = len(playlists_idtags.id.unique())
num_playlists_output = len(df_cluster.id.unique())
print("{0:.1f}% of all playlists is not assigned to any cluster".format((all_playlists - num_playlists_output) / all_playlists * 100))

# reshape data frame (rows: playlists, columns: clusters + mood)
playlist_cluster = pd.concat([moods_matches, df_cluster[['id', 'cluster']]]).drop_duplicates()
playlist_cluster['values'] = 1
df_pivot = playlist_cluster.pivot(index='id', columns='cluster', values='values').fillna(0)
#This returns the df (long-to-wide) with column: playlist_id, section_1, section_2,..., section_n meaning that one playlist can fit several main sectionName


# ------------------------------- #
#       CLUSTER STATISTICS		  #
# ------------------------------- #

# The number of followers is 'object' and for now we are interested in the number of playlist and its percentage to total dataframe

def calculate_cluster_indices(df_pivot):
    return {column: df_pivot.index[(df_pivot[column ] == 1)].tolist() for column in df_pivot.columns}


def cluster_stats(df_pivot, playlists):
    # collect playlist ids for each label
    cluster_indices = calculate_cluster_indices(df_pivot)

    # calculate number of followers and market share for each cluster
    cluster_stats = pd.DataFrame()
    for keys, values in cluster_indices.items():
        cluster_temp = pd.DataFrame({'label':keys, 
                                'num_playlists':len(values)}, 
                               index=[0])
        cluster_stats = pd.concat([cluster_temp, cluster_stats])
    
    cluster_stats['perc_playlists'] = cluster_stats['num_playlists'] / len(df_pivot) * 100
    print(cluster_stats.sort_values('perc_playlists', ascending=False).reset_index(drop=True))
    return cluster_stats


clusters_stats = cluster_stats(df_pivot, playlists)

# ------------------------------- #
#          EXPORT RESULTS         #
# ------------------------------- #

# WE GO FOR 1, 90%

#Make output directory
path = '../gen'
try:
    os.mkdir(path)
except:
    print('dir exists')
    
#Exportfile
df_pivot.to_csv(path + "/df_playlists_cluster.csv")

%reset -f

In our 500 playlists, we can categorize as follows:

label	num_playlists classified	(%)
Rock	476	95.2
R&B	480	96
Pop	483	96.6
Mood	393	78.6
Jazz	479	95.8
Indie	476	95.2
Hiphop	477	95.4
Funk	477	95.4
EDM	479	95.8
Dance	477	95.4
Country	478	95.6
Classical	485	97
Blues	473	94.6

Model-free Evidence

Merge relevant dataset and calculate relevant variables

# Create covid step variable, include=FALSE}
playlist_weekly_df[, covid_step:= ifelse(week >= 30, 1, 0)]

# Imposing category
# Create key of id and playlist_id for merging
playlistcluster_df <- merge(playlist_cluster,playlist_df, by="id")
df <- playlist_weekly_df[playlistcluster_df, on = c("id"), nomatch = 0L] #nomatch = 0L argument ensures that only matching rows are returned (similar to a left join in SQL)

df[, id := as.factor(id)]
df[, week := as.factor(week)]
df[, covid_step := as.factor(covid_step)]
df[, curator := as.factor(curator)]
df[, curator := relevel(curator, ref = "Spotify")]
df[, percentile := as.factor(percentile)]
df[, percentile := factor(percentile, levels = c("top1%","top10%","top30%","tail"))] #So the graph shows nicely

rm(list = c("playlist_df","playlist_weekly_df", "playlist_cluster", "playlistcluster_df"))

#Create grand mean centering for category factor to avoid confusing interpretation related to genre
genres <- c('mood','blues','classical','country','edm','dance','funk','hiphop','indie','jazz','pop','rnb','rock')

for (i in genres) df[, paste0(i,'_mc'):=get(i)-mean(get(i))]

#Create relevant variables (first difference)
df[, log_followers := log(followers+1)]
setorder(df, id, week)
lag_1 <- function(x, k = 1) head(c(rep(NA, k), x), length(x))
df[, dlog_followers := log_followers-lag_1(log_followers), by = c('id')]
df[, dmlshare := MLShare-lag_1(MLShare), by = c('id')] 
df[, dlog_trackshare := log(ntracksotherpl)-log(lag_1(MLShare)), by = c('id')]

Plot average change of followers over time by curator, MLShare and Percentile

#MLShare
df[, MLSharegroup := cut(MLShare, breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1), 
                                 labels = c("To 20%", "20% - 40%", "40% - 60%", "60% - 80%", "80% - 100%"), 
                                 include.lowest = TRUE)]
df_summary_byMLshare <- df[, .(followers_mean = mean(dlog_followers, na.rm = TRUE),
                     followers_se = sd(dlog_followers, na.rm = TRUE) / sqrt(.N)), 
                 by = .(week, MLSharegroup)]
df_summary_byMLshare[, week := as.numeric(as.character(week))]

ggplot(df_summary_byMLshare, aes(x = week, y = followers_mean, color = MLSharegroup)) + 
  geom_line() +  # Line plot for the means
  geom_point() + 
  geom_errorbar(aes(ymin = followers_mean - followers_se, ymax = followers_mean + followers_se), width = 0.2) +
  geom_vline(xintercept = 30, linetype = "dashed", color = "red", size = 1.2) +
  labs(
    x = "Week",
    y = "Follower Growth (Dlogfollowers)",
    color = "% of MLShare in Track",
    title = "Weekly Change of Followers by Week and Size of MLShare"
  ) +
  theme_minimal()

#Curator
df_summary_bycurator <- df[, .(followers_mean = mean(dlog_followers, na.rm = TRUE),
                     followers_se = sd(dlog_followers, na.rm = TRUE) / sqrt(.N)), 
                 by = .(week, curator)]
df_summary_bycurator[, week := as.numeric(as.character(week))]

ggplot(df_summary_bycurator, aes(x = week, y = followers_mean, color = curator)) + 
  geom_line() +  # Line plot for the means
  geom_point() + 
  geom_errorbar(aes(ymin = followers_mean - followers_se, ymax = followers_mean + followers_se), width = 0.2) +
  geom_vline(xintercept = 30, linetype = "dashed", color = "red", size = 1.2) +
  labs(
    x = "Week",
    y = "Follower Growth (Dlogfollowers)",
    color = "Curator",
    title = "Weekly Change of Followers by Week and Curator"
  ) +
  theme_minimal()

#Percentile
df_summary_bypercentile <- df[, .(followers_mean = mean(dlog_followers, na.rm = TRUE),
                               followers_se = sd(dlog_followers, na.rm = TRUE) / sqrt(.N)), 
                           by = .(week, percentile)]
df_summary_bypercentile[, week := as.numeric(as.character(week))]

ggplot(df_summary_bypercentile, aes(x = week, y = followers_mean, color = percentile)) + 
  geom_line() +  # Line plot for the means
  geom_point() + 
  geom_errorbar(aes(ymin = followers_mean - followers_se, ymax = followers_mean + followers_se), width = 0.2) +
  geom_vline(xintercept = 30, linetype = "dashed", color = "red", size = 1.2) +
  labs(
    x = "Week",
    y = "Follower Growth (Dlogfollowers)",
    color = "percentile",
    title = "Weekly Change of Followers by Week and Playlist Percentile"
  ) +
  theme_minimal()

Heterogenous change in followers across playlists curated by different curators, having different size (or percentile), and composed of different tracks produced by major labels.

Empirical Results

We employed fixed effect model on followers growth as follows:

Estimating fixed effect

m <- plm(dlog_followers ~ covid_step + covid_step*curator + covid_step*percentile 
         + covid_step*mood_mc + covid_step*blues_mc + covid_step*classical_mc + covid_step*country_mc 
         + covid_step*edm_mc + covid_step*dance_mc  + covid_step*funk_mc + covid_step*hiphop_mc 
         + covid_step*indie_mc + covid_step*jazz_mc + covid_step*pop_mc  + covid_step*rnb_mc + covid_step*rock_mc
         + covid_step*MLShare + covid_step*log(ntracksotherpl) + feature + week, data = df, index = c("id","week"), model = "within")
m_rob <- m
m_rob$vcov <- vcovHC(m, type = "HC1")

summary(m_rob)

Visualizing Results

# Function to perform delta method calculations
calculate_effects <- function(calcs_list, model) {
  calculations <- rbindlist(lapply(names(calcs_list), function(name) {
    res <- data.frame(deltaMethod(model, calcs_list[[name]]))
    colnames(res) <- c('est','se','q025','q975')
    res$type <- name
    res$est <- res$est * 100  # Convert to percentage
    res$se <- res$se * 100    # Convert to percentage
    return(res)
  }))
  return(calculations)
}

# Calculate different effects of curator
calcs_curator = list(Spotify='covid_step1', 
                     Indie="covid_step1+`covid_step1:curatorIndie`",
                     MajorLabel="covid_step1+`covid_step1:curatorMajorLabel`",
                     Professional="covid_step1+`covid_step1:curatorProfessional`",             
                     User="covid_step1+`covid_step1:curatorUser`"
)

m_rob_curatoreffect <- calculate_effects(calcs_curator, m_rob)

# Plot the effects of curator
ggplot(m_rob_curatoreffect, aes(x=type, y=est, fill = type)) +
  geom_bar(stat="identity", alpha=0.7) +
  geom_errorbar(aes(ymin=est-se, ymax=est+se), width=0.4, colour="darkgrey", alpha=0.9, size=0.9) +
  scale_fill_brewer(palette = "Set1") +
  theme_minimal() +
  labs(title ="The estimated weekly effect of pandemic declaration (Using Delta Rule) across Playlist Curator",x="Curator", y="Estimated Effects of Covid-19 on Followers' Growth (%)")

# Calculate different effects of percentile
calcs_percentile = list(Top1='covid_step1', 
                        Top10="covid_step1+`covid_step1:percentiletop10%`",
                        Top30="covid_step1+`covid_step1:percentiletop30%`",
                        Tail="covid_step1+`covid_step1:percentiletail`"
)

m_rob_percentileffect <- calculate_effects(calcs_percentile, m_rob)

# Plot the effects of percentile, choose different plot
ggplot(m_rob_percentileffect, aes(x = type, y = est, fill = type)) +
  geom_bar(stat = "identity", width = 0.6, alpha = 0.7) +
  geom_errorbar(aes(ymin = est - se, ymax = est + se), width = 0.2, colour = "orange", size = 1) +
  coord_flip() +  # Flip coordinates to make labels readable
  scale_fill_brewer(palette = "Pastel1") + e
  theme_minimal() +
  theme(
    legend.position = "none",  
    axis.title.y = element_blank(),  
    axis.text.y = element_text(size = 12),  
    plot.title = element_text(hjust = 0.5)  
  ) +
  labs(
    x = "Percentile Group",
    y = "Estimated Effects of Covid-19 on Followers' Growth (%)",
    title = "The estimated weekly effect of pandemic declaration (Using Delta Rule) across Playlist Percentile"
  )

From our simulated data, we couldn't find any significant evidence of the effect of curator and playlists popularity (i.e., percentile in followers distribution) on followers' growth

Selected Coefficients	Dependent variable:dlog_followers
MLShare	-0.003 (0.017)
Covid*MLShare	0.009 (0.010)
log(ntracksotherpl)	-0.001 (0.002)
Covid*log(ntracksotherpl)	0.001 (0.001)
Note: ^p<0.1; ^p<0.05; ^**p<0.01

From our simulated data, we couldn't find any significant evidence of the effect of content providers (MLShare) and tracks' popularity (ntrackotherpl) on followers' growth

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Do The Winners Take It All? How Covid-19 Affected Playlist Followers on Spotify

Author:

Abstract:

Overview

Walkthrough with simulated data

Required package

The data

Playlist classification

Model-free Evidence

Empirical Results

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
analysis		analysis
data		data
gen		gen
img		img
slide_deck		slide_deck
README.md		README.md

tanetpongc/spotify_covid-19

Folders and files

Latest commit

History

Repository files navigation

Do The Winners Take It All? How Covid-19 Affected Playlist Followers on Spotify

Author:

Abstract:

Overview

Walkthrough with simulated data

Required package

The data

Playlist classification

Model-free Evidence

Empirical Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages