Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Predicting Song Popularity on Digital Streaming Platforms, High school final essays of Machine Learning

The use of machine learning techniques to analyze and predict song popularity on digital streaming platforms like spotify, apple music, and deezer. The dataset includes information on user listening preferences, song attributes, and platform-specific metrics such as playlist inclusion and chart performance. The analysis focuses on identifying the key factors that influence song popularity, including musical features, playlist inclusion, and cross-platform comparisons. The study employs various machine learning models, including random forest, neural networks, and logistic regression, to predict song popularity and provide insights for the music industry. The findings aim to support strategic decision-making for artists, labels, and platform providers, enabling them to better understand and anticipate music trends and consumer preferences.

Typology: High school final essays

2022/2023

Uploaded on 06/04/2024

melisa-oktay
melisa-oktay 🇹🇷

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
T.C. ISTANBUL BILGI UNIVERSITY
MIS 303 TERM PROJECT
REPORT
Most Streamed Spotify Songs 2023
Zeynep Bora - 12052027
Melisa Oktay - 12152016
Ozan Tank -12152014
Oğuzhan Sönmeztürk - 12152013
Yusuf Alev - 12152025
2023
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Predicting Song Popularity on Digital Streaming Platforms and more High school final essays Machine Learning in PDF only on Docsity!

T.C. ISTANBUL BILGI UNIVERSITY

MIS 303 TERM PROJECT

REPORT

Most Streamed Spotify Songs 2023

Zeynep Bora - 12052027

Melisa Oktay - 12152016

Ozan Tank -

Oğuzhan Sönmeztürk - 12152013

Yusuf Alev - 12152025

Table of Contents

    1. INTRODUCTION..............................................................................................................
    • 1.1. Business Environment..................................................................................................
    • 1.2. Dataset Description......................................................................................................
    • 1.3. Objective (s).................................................................................................................
    1. DESCRIPTIVES.................................................................................................................
    1. DATA PREPROCESSING.................................................................................................
    1. MACHINE LEARNING IN ACTION...............................................................................
    • 4.1. Selected algorithms......................................................................................................
    • 4.2. Performance measurement...........................................................................................
    • 4.3. Experiments..................................................................................................................
    • 4.4. Results..........................................................................................................................
    1. INSIGHTS..........................................................................................................................
    1. REFERENCES...................................................................................................................

Shows the number of artists who contributed to the songs. released_year, released_month, released_day: It contains information about the year, month and day the song was released. in_spotify_playlists: Shows how many Spotify playlists the song has been included in. in_spotify_charts: Includes the song's presence and ranking on Spotify charts. streams: Shows the total number of streams of the song on Spotify. in_apple_playlists: Shows how many Apple Music playlists the song has been included in. in_apple_charts: Includes the song's presence and ranking on Apple Music charts. in_deezer_playlists: Shows how many Deezer playlists the song has been included in. in_deezer_charts: Contains the song's presence and ranking on the Deezer charts. in_shazam_charts: Includes the song's presence and ranking on the Shazam charts. bpm: It contains tempo information, i.e. the beats per minute of the song. key: Contains the key information of the song. mode: Contains the mood (major or minor) of the song. danceability_%: It contains the danceability percentage of the song. valence_%: It contains the positivity of the musical content of the song. energy_%: It includes the perceived energy level of the song. acousticness_%: Contains the amount of acoustic sound of the song. instrumentalness_%: Contains the percentage of the song's instrumental content. liveness_%: It includes the presence of live performance elements of the song. speechiness_%: Contains the percentage of speech content of the song. 1.3. Objectives The project's primary objectives are to examine the songs and artists that Spotify users listen to the most in 2023 and to identify industry trends. The project's outline is shown below: Type of Analysis: Through the use of statistical analysis methods and exploratory data analysis (EDA), this research seeks to identify patterns and relationships within the dataset. It also seeks to use machine learning techniques to investigate the elements influencing a song's popularity.

Principal Analyses: Using the Spotify dataset, we will assess how different musical characteristics—such as tempo, danceability, energy, etc.—affect popularity. We'll also look at how song popularity across different platforms is impacted by a song's position on charts and playlists. Artist and Song Popularity: To identify the most well-liked musicians and songs, we will examine influential people and popular songs in the music business. Platform Comparisons: Using performance data from various platforms, including Shazam, Apple Music, Spotify, and Deezer, we will attempt to comprehend the cross-platform competition in the music industry. Machine Learning Model: Using the dataset's attributes, we plan to create a machine learning model that forecasts song popularity. We can ascertain which characteristics are indicative of song popularity with the aid of this model. Expectations: By identifying the musical traits and variables that influence song success, we hope to gain insight into industry patterns and anticipate future hit songs. We also want to assess how the popularity of songs and artists is affected by performances on various platforms. By identifying key elements in the music industry, a thorough study and interpretation of the data set's contents aims to forecast future trends.

2. DESCRIPTIVES

  • This descriptive aims to explore and visualize the minimum, mean, and maximum streaming values associated with different musical keys. The line plot facilitates the

4. MACHINE LEARNING IN ACTION

4.1. Selected Algorithms Supervised algorithms:

- Random Forest Classifier: The Random Forest Classifier is a powerful supervised learning algorithm. Operating as an ensemble of decision trees, this algorithm excels in capturing complex relationships and interactions among diverse song features. In the context of the provided codes, the Random Forest Classifier contributes to understanding the factors influencing song popularity, providing a robust predictive model for classification tasks. - Neural Networks: Neural Networks, a cornerstone of deep learning, are harnessed for their ability to model intricate patterns and non-linear relationships in the music dataset. With layers of interconnected nodes mimicking the human brain, Neural Networks excel in learning complex representations. In the provided codes, they play a pivotal role in predicting song popularity, offering a nuanced understanding of the multifaceted factors. -Logistic Regression: Logistic Regression, a versatile and interpretable supervised learning algorithm. Despite its name, Logistic Regression is adept at binary classification tasks. In this context, it provides a probabilistic framework to discern the likelihood of a song attaining 'top100' status, offering valuable insights into the predictive influence of individual features. 4.2. Performance Measurement Logistic Regression Model

Accuracy Score of Logistic Regression Model: 89.17% Precision of Logistic Regression Model: 57.14% Recall of Logistic Regression Model: 22.22% -High accuracy indicates good overall model performance. -Low precision indicates that some of the model's positive predictions are incorrect. -A lower recall indicates that true positives were detected less successfully by the model. Random Forest Model Accuracy Score of Random Forest Model: 94.27% Precision of Random Forest Model: 84.62% Recall of Random Forest Model: 61.11% -High accuracy indicates that the overall success rate of the model is high.

Model 1 - Playlists Only (Apple, Spotify, Deezer)Mean Squared Error (MSE): 2.  Mean Absolute Error (MAE): 1. Model 2 - Song Properties Only (BPM and Others)Mean Squared Error (MSE): 1.  Mean Absolute Error (MAE): 1. Model 3 - Both Playlists & Song PropertiesMean Squared Error (MSE): 0.  Mean Absolute Error (MAE): 0. Interpretation:MSE and MAE Values: Lower values of MSE and MAE indicate better model performance. In this context, Model 3, which uses both playlists and song properties as features, has the lowest MSE (0.54) and MAE (0.55), suggesting that it performs the best among the three models.  Model 3 Significance: The inclusion of both playlists and song properties seems to enhance the predictive performance of the model. This may suggest that a combination of factors, such as playlist inclusion and song properties, contributes more effectively to predicting the number of streams.  Convergence Warning: It's worth noting that the models in all three cases issued a convergence warning, indicating that the optimization process did not converge within the maximum number of iterations (200). This might suggest that increasing the number of iterations or adjusting other hyperparameters could potentially improve convergence. 4.3 Experiments

-New features were created by processing the categorical data in the key and mode columns using the pd.get_dummies function. This made each category value into a separate column and assigned values of 0 or 1 to those columns. -Then, a new target variable named top100 was created by sorting from largest to smallest according to the streams column. The first 100 rows of this column are filled with 1, and the remaining rows are filled with 0. -As a result, the first 100 most listened to songs were analyzed by sorting them with one-hot- encoding columns derived from key and mode columns. 4.4. Results Songs that have more streams on spotify generally have a C# key

There is a positive correlation of most streamed songs by being in more playlists, whether it is in spotify, apple, or deezer

A meticulous preprocessing pipeline was implemented to ensure the

dataset's readiness for machine learning tasks:

 Handling Missing Values: Robust techniques were applied to address

missing values, ensuring the integrity of the dataset and preventing biases

in subsequent analyses.

 Categorical Variable Encoding: Categorical variables were

appropriately encoded, enabling machine learning models to interpret and

leverage these features effectively.

 Normalization of Numerical Features: Numerical features underwent

normalization to bring them to a comparable scale, avoiding dominance

by features with larger magnitudes.

Random Forest Classifier:

The Random Forest classifier was applied to predict playlist inclusion:

 Feature Importance: The model's feature importance analysis

highlighted crucial factors influencing a song's inclusion in Spotify

playlists. This information is invaluable for artists and producers aiming

to optimize their track for playlist consideration.

Neural Network (MLPRegressor):

Neural networks, specifically MLPRegressor, were employed for regression

tasks:

 Logarithmic Prediction: The logarithmic transformation of streaming

values facilitated regression modeling, and the model's performance was

assessed through Mean Squared Error (MSE) and Mean Absolute Error

(MAE).

 Playlist vs. Song Properties: The model was trained on both playlist-

related features and song properties, demonstrating improved predictive

capabilities compared to models using only one set of features.

Logistic Regression:

Logistic Regression was used to predict binary outcomes related to a song's

chart success:

 Likelihood Prediction: The model provided insights into the likelihood

of a song entering the top 100 charts based on its features. This

information could guide artists and labels in crafting songs with chart-

topping potential.

5. INSIGHTS

Accurate popularity prediction is one of the main goals of the Random Forest model, which also serves to enhance the broader purposes of feature analysis, playlist review, and label strategy. Due to its better performance in regression metrics, this tool may be relied upon to help achieve the project's many objectives. This study used a holistic approach to predicting song popularity accurately, using a variety of machine learning models to reveal insights that are critical for decision-making in the music industry. Predicting popularity, determining feature importance, examining playlist and chart presence, doing key and mode analysis, assessing model performance, improving user experience, and supporting record labels' strategic decision-making were the main goals. The study used three different machine learning models—Random Forest, Neural Network, and Logistic Regression—to achieve its goal of predicting popularity. Regression studies were used to evaluate these models, with an emphasis on metrics like mean squared error and mean absolute error. While the Random Forest model, with its innate capacity to reveal feature importance, supplied important insights into the elements impacting song popularity, the Logistic Regression model served as a baseline for prediction. One advanced model that explored popularity prediction was the Neural Network, which used playlist data, song attributes, or both to forecast popularity. Regression metrics showed that the Random Forest model performed exceptionally well. Its mean absolute error and mean squared error were significantly smaller, particularly when playlists and song attributes were taken into account. This implies that the Random Forest model does exceptionally well in predicting song popularity, which is why it is the model of choice for this particular application.