









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
The use of machine learning techniques to analyze and predict song popularity on digital streaming platforms like spotify, apple music, and deezer. The dataset includes information on user listening preferences, song attributes, and platform-specific metrics such as playlist inclusion and chart performance. The analysis focuses on identifying the key factors that influence song popularity, including musical features, playlist inclusion, and cross-platform comparisons. The study employs various machine learning models, including random forest, neural networks, and logistic regression, to predict song popularity and provide insights for the music industry. The findings aim to support strategic decision-making for artists, labels, and platform providers, enabling them to better understand and anticipate music trends and consumer preferences.
Typology: High school final essays
1 / 15
This page cannot be seen from the preview
Don't miss anything!
Shows the number of artists who contributed to the songs. released_year, released_month, released_day: It contains information about the year, month and day the song was released. in_spotify_playlists: Shows how many Spotify playlists the song has been included in. in_spotify_charts: Includes the song's presence and ranking on Spotify charts. streams: Shows the total number of streams of the song on Spotify. in_apple_playlists: Shows how many Apple Music playlists the song has been included in. in_apple_charts: Includes the song's presence and ranking on Apple Music charts. in_deezer_playlists: Shows how many Deezer playlists the song has been included in. in_deezer_charts: Contains the song's presence and ranking on the Deezer charts. in_shazam_charts: Includes the song's presence and ranking on the Shazam charts. bpm: It contains tempo information, i.e. the beats per minute of the song. key: Contains the key information of the song. mode: Contains the mood (major or minor) of the song. danceability_%: It contains the danceability percentage of the song. valence_%: It contains the positivity of the musical content of the song. energy_%: It includes the perceived energy level of the song. acousticness_%: Contains the amount of acoustic sound of the song. instrumentalness_%: Contains the percentage of the song's instrumental content. liveness_%: It includes the presence of live performance elements of the song. speechiness_%: Contains the percentage of speech content of the song. 1.3. Objectives The project's primary objectives are to examine the songs and artists that Spotify users listen to the most in 2023 and to identify industry trends. The project's outline is shown below: Type of Analysis: Through the use of statistical analysis methods and exploratory data analysis (EDA), this research seeks to identify patterns and relationships within the dataset. It also seeks to use machine learning techniques to investigate the elements influencing a song's popularity.
Principal Analyses: Using the Spotify dataset, we will assess how different musical characteristics—such as tempo, danceability, energy, etc.—affect popularity. We'll also look at how song popularity across different platforms is impacted by a song's position on charts and playlists. Artist and Song Popularity: To identify the most well-liked musicians and songs, we will examine influential people and popular songs in the music business. Platform Comparisons: Using performance data from various platforms, including Shazam, Apple Music, Spotify, and Deezer, we will attempt to comprehend the cross-platform competition in the music industry. Machine Learning Model: Using the dataset's attributes, we plan to create a machine learning model that forecasts song popularity. We can ascertain which characteristics are indicative of song popularity with the aid of this model. Expectations: By identifying the musical traits and variables that influence song success, we hope to gain insight into industry patterns and anticipate future hit songs. We also want to assess how the popularity of songs and artists is affected by performances on various platforms. By identifying key elements in the music industry, a thorough study and interpretation of the data set's contents aims to forecast future trends.
4.1. Selected Algorithms Supervised algorithms:
- Random Forest Classifier: The Random Forest Classifier is a powerful supervised learning algorithm. Operating as an ensemble of decision trees, this algorithm excels in capturing complex relationships and interactions among diverse song features. In the context of the provided codes, the Random Forest Classifier contributes to understanding the factors influencing song popularity, providing a robust predictive model for classification tasks. - Neural Networks: Neural Networks, a cornerstone of deep learning, are harnessed for their ability to model intricate patterns and non-linear relationships in the music dataset. With layers of interconnected nodes mimicking the human brain, Neural Networks excel in learning complex representations. In the provided codes, they play a pivotal role in predicting song popularity, offering a nuanced understanding of the multifaceted factors. -Logistic Regression: Logistic Regression, a versatile and interpretable supervised learning algorithm. Despite its name, Logistic Regression is adept at binary classification tasks. In this context, it provides a probabilistic framework to discern the likelihood of a song attaining 'top100' status, offering valuable insights into the predictive influence of individual features. 4.2. Performance Measurement Logistic Regression Model
Accuracy Score of Logistic Regression Model: 89.17% Precision of Logistic Regression Model: 57.14% Recall of Logistic Regression Model: 22.22% -High accuracy indicates good overall model performance. -Low precision indicates that some of the model's positive predictions are incorrect. -A lower recall indicates that true positives were detected less successfully by the model. Random Forest Model Accuracy Score of Random Forest Model: 94.27% Precision of Random Forest Model: 84.62% Recall of Random Forest Model: 61.11% -High accuracy indicates that the overall success rate of the model is high.
Model 1 - Playlists Only (Apple, Spotify, Deezer) Mean Squared Error (MSE): 2. Mean Absolute Error (MAE): 1. Model 2 - Song Properties Only (BPM and Others) Mean Squared Error (MSE): 1. Mean Absolute Error (MAE): 1. Model 3 - Both Playlists & Song Properties Mean Squared Error (MSE): 0. Mean Absolute Error (MAE): 0. Interpretation: MSE and MAE Values: Lower values of MSE and MAE indicate better model performance. In this context, Model 3, which uses both playlists and song properties as features, has the lowest MSE (0.54) and MAE (0.55), suggesting that it performs the best among the three models. Model 3 Significance: The inclusion of both playlists and song properties seems to enhance the predictive performance of the model. This may suggest that a combination of factors, such as playlist inclusion and song properties, contributes more effectively to predicting the number of streams. Convergence Warning: It's worth noting that the models in all three cases issued a convergence warning, indicating that the optimization process did not converge within the maximum number of iterations (200). This might suggest that increasing the number of iterations or adjusting other hyperparameters could potentially improve convergence. 4.3 Experiments
-New features were created by processing the categorical data in the key and mode columns using the pd.get_dummies function. This made each category value into a separate column and assigned values of 0 or 1 to those columns. -Then, a new target variable named top100 was created by sorting from largest to smallest according to the streams column. The first 100 rows of this column are filled with 1, and the remaining rows are filled with 0. -As a result, the first 100 most listened to songs were analyzed by sorting them with one-hot- encoding columns derived from key and mode columns. 4.4. Results Songs that have more streams on spotify generally have a C# key
There is a positive correlation of most streamed songs by being in more playlists, whether it is in spotify, apple, or deezer
Accurate popularity prediction is one of the main goals of the Random Forest model, which also serves to enhance the broader purposes of feature analysis, playlist review, and label strategy. Due to its better performance in regression metrics, this tool may be relied upon to help achieve the project's many objectives. This study used a holistic approach to predicting song popularity accurately, using a variety of machine learning models to reveal insights that are critical for decision-making in the music industry. Predicting popularity, determining feature importance, examining playlist and chart presence, doing key and mode analysis, assessing model performance, improving user experience, and supporting record labels' strategic decision-making were the main goals. The study used three different machine learning models—Random Forest, Neural Network, and Logistic Regression—to achieve its goal of predicting popularity. Regression studies were used to evaluate these models, with an emphasis on metrics like mean squared error and mean absolute error. While the Random Forest model, with its innate capacity to reveal feature importance, supplied important insights into the elements impacting song popularity, the Logistic Regression model served as a baseline for prediction. One advanced model that explored popularity prediction was the Neural Network, which used playlist data, song attributes, or both to forecast popularity. Regression metrics showed that the Random Forest model performed exceptionally well. Its mean absolute error and mean squared error were significantly smaller, particularly when playlists and song attributes were taken into account. This implies that the Random Forest model does exceptionally well in predicting song popularity, which is why it is the model of choice for this particular application.