Classification of Explicit Songs Based on Lyrics Using Random Forest Algorithm

This study focuses on the potential negative impact of explicit songs on children and adolescents. Although an explicit song labeling program is currently in place, its coverage is limited to songs released by artists affiliated with the Recording Industry Association of America (RIAA). Consequently, songs falling outside the program's scope remain inadequately labeled. To address this issue, a machine learning model was developed to effectively classify explicit songs and mitigate mislabeling challenges. A comprehensive dataset of song lyrics was collected using web scraping techniques for the purpose of constructing the classification model. The model was trained using the TF-IDF vectorization method and the random forest algorithm. A meticulous comparison of distribution parameters was conducted between the training and testing data sets to determine the optimal model. This superior model achieved a training-testing data distribution ratio of 90:10, with an impressive accuracy of 96.3%, precision of 99.3%, recall of 93.5%, and an f1-score of 96.3%. The classification results revealed that explicit songs accounted for 39.22% of the dataset, and the visual representation highlighted the fluctuating prevalence of explicit songs over time. Additionally, the hip-hop/rap genre exhibited the highest proportion of explicit songs, reaching a staggering 92%.


INTRODUCTION
The current digital era has made various digital contents, including music, easily accessible to everyone. However, children and adolescents often have subjective perceptions and tend to consume information without proper filtering. Therefore, it is crucial to protect them from explicit songs that contain offensive language, depictions of violence, sex, or substance abuse, which are inappropriate for their age. A study [1] highlights one negative impact of such songs, stating that listening to songs with sexual lyrics can influence adolescent boys' sexting behavior in the future.
Several studies have utilized machine learning for song classification. Research [2] compared a dictionary-based method with a deep neural network for detecting explicit content in English songs and found that the complex model only slightly outperformed the simpler approach. Another study [3], which used naive Bayes to classify Indonesian songs based on age, reported low accuracy with this algorithm. Conversely, in research [4], an exploration of various vectorization methods and algorithms revealed that TF-IDF combined with the random forest algorithm achieved the highest accuracy. The effectiveness of the random forest algorithm was also demonstrated in a study [5] that classified explicit songs based on toxicity, where it outperformed logistic regression and XGBClassifier. Additionally, a study [6] focused on detecting explicit content in Korean songs using Adaboost and bagging techniques, and it determined that bagging with selective POS and IDF tags produced the most favorable results.
This study introduces a novel approach to classifying explicit songs, departing from previous methodologies that employed complete song lyrics. Instead, the study focuses on utilizing specific excerpts that best capture the explicitness of a song. This approach aims to address potential classification errors observed in prior research. By honing in on relevant lyrical segments, the study aims to enhance accuracy, improve classification effectiveness, and streamline data processing complexity. Additionally, the research explores four distinct scenarios for training-testing data distribution to identify the optimal model. The classification results are then visualized to gain valuable insights into the annual growth of explicit songs and identify music genres with the highest percentage of explicit content. Employing the random forest algorithm and leveraging song excerpts for classification, this research makes a substantial contribution to the development of more effective explicit song classification methods. Moreover, it provides valuable information on growth trends and the characteristics of musical genres associated with explicit songs. This article contains explicit language and themes. Some readers may find these elements offensive, discriminatory, or vulgar. We encourage readers to approach the article objectively and respond thoughtfully to the language and explicit themes presented.

METHODS
The English song lyrics dataset utilized in this study was obtained through a meticulous web scraping process. Prior to the classification phase, the data undergoes several essential steps, including labeling, a split lyric process to extract a representative portion reflecting the song's explicitness, and preprocessing. For the classification task, the random forest algorithm in conjunction with the TF-IDF vectorization method is employed. Figure 1 provides a detailed illustration of the specific methods employed in this study.  In this study, data collection is facilitated through web scraping, which involves utilizing the Python programming language and the Beautiful Soup library. Beautiful Soup, a Python package renowned for effectively extracting structured data from web pages, proves to be an ideal tool for scraping song lyrics. Notably, it offers a user-friendly interface, ensuring ease of use [7]. The data to be collected encompasses various elements such as artist name, song title, song lyrics, year of release, and genre. To obtain this information, web scraping techniques were employed, extracting data from diverse sources. For artist name, song title, and lyrics, web scraping was performed on the lirik.kapanlagi.com website. Conversely, for details regarding the year of release and song genre, web scraping was conducted on the Google search site. Figure 2 depicts an example website page employed as a data source.  Figure 2 illustrates a webpage from the lirik.kapanlagi.com site, showcasing the data collection process for English song lyrics. The collected data comprises the artist name (highlighted by a green box), song title (highlighted by a red box), and song lyrics (highlighted by a blue box). Figure 3 displays a webpage from the Google search site, serving as a source for gathering information on the release year of a song. The web scraping process involves inserting the song title and artist name into the URL (highlighted by a red box) and subsequently extracting the release year data from the Google snippet feature (highlighted by a green box).

Data Labeling
The process of data labeling involves assigning explicit or non-explicit labels to songs. This labeling process is carried out manually by referencing the available information on the Spotify music streaming platform. In particular, songs featuring an "E" icon on Spotify are designated as explicit. Figure 5 depicts the specific "E" icon referenced in this process.

Split Lyric
The term "split lyric" pertains to the automated process of segmenting lyrics into individual sentences. This split lyric process is executed utilizing the Python programming language. Following the division of lyrics, a meticulous selection procedure ensues to identify sentences that most accurately capture the explicit or non-explicit nature of the song. This selection process is conducted manually, considering explicit song criteria encompassing offensive language, depictions of violence, sexual behavior, substance abuse, offensive and discriminatory terms, as well as instances of physical or mental abuse.

Preprocessing
Preprocessing is an important step that must be carried out before data is classified, where the goal is to transform unstructured data into a more structured and uniform. In addition, the preprocessing process also aims to reduce data dimensions to make it easier to process. Preprocessing methods can improve dataset quality and text classification results [8]. In this study, several stages of preprocessing were carried out as follows. a. Lowercasing is a step-in data preprocessing to convert all letters in the lyrics to lowercase. b. Cleaning is a step-in data preprocessing to remove characters such as punctuation marks and numbers in lyrics. c. Stopwords removal, is a step-in data preprocessing to eliminate very common words in a language that do not contribute much to the meaning of the text. d. Lemmatization is a step-in data preprocessing to convert words into their basic form based on the corpus.

TF-IDF
TF-IDF is a method to determines term significance through calculating it weight [9]. The TF-IDF calculation in this study aims to select the most informative and IDF (Inverse Document Frequency) is a method for determining the importance of a word in a collection of documents.

TF-IDF = TF x IDF
The TF-IDF assigns the weight of a term to a document by giving a high value when the term appears many times in a number of documents, a low value when the term appears less frequently in a document or occurs in many documents, and the lowest value when the term appears in almost all documents.

Random Forest
In this study, the random forest algorithm is employed for explicit song classification. The random forest algorithm encompasses three crucial components. Firstly, it utilizes bootstrap sampling to construct multiple decision trees. Secondly, each decision tree independently generates predictions. Lastly, the random forest algorithm combines the outputs from each decision tree, employing a majority voting scheme to determine the final classification result [10].
The Random Forest algorithm is renowned for its simplicity and impressive performance, particularly when handling high-dimensional datasets [4]. To ensure expedited processing and optimize the efficiency of the random forest model, it is crucial to carefully determine the number of trees employed. As the number of decision trees increases, the complexity of the prediction step escalates [11]. However, despite this heightened complexity, the utilization of multiple decision trees endows random forests with the potential to achieve significantly superior accuracy compared to a single classifier [12]. It is precisely this remarkable capability that establishes the random forest algorithm as an ideal choice for classifying explicit songs in this study. Figure 6 illustrates the workflow of the Random Forest algorithm. It is worth noting that a significant portion of the dataset lacks information on the year of release and genre due to the unavailability of related data on the web pages from which the scraping process was conducted.   Figure 9 showcases the distribution of songs across eleven popular music genres.
Notably, there is a substantial disparity in the quantity of songs between the genre with the highest and lowest data count. The rock genre reigns supreme, with a staggering 12,803 songs, while the classic genre lags behind with a modest 150 songs. It is important to acknowledge that some songs may belong to multiple genres, as they possess musical characteristics that align with various genres simultaneously. To provide an insight into the web scraping outcomes, an example of the results is presented in Table 1.

Data Labeling Result
The labeling process is conducted manually by referencing the label information available on the Spotify music streaming platform. This information is indicated by the presence of an "E" logo associated with a song when searched on Spotify. However, considering the potential for certain songs to contain explicit content yet lack the explicit label, songs without the "E" logo undergo manual verification. During this manual verification, the presence of offensive language, depictions of violence, sexual behavior, substance abuse, offensive and discriminatory words, as well as physical or mental abuse within the lyrics, is meticulously examined. To provide a glimpse into the results of the data labeling process, an example is presented in Table 2 below.

Lyric
Label "See nigga we thugged out for a reason. Niggaz ain't thuggin, because, they like the look nigga. Or they like to be on these…" Explicit "It's been a long time coming. This breakdown in our sights. And I just keep on running. Back for the same fight. I'll look…"

Non-Explicit
The process of labeling song data yielded a total of 1,107 labeled songs, which were classified into explicit and non-explicit categories. Figure 10 provides a visual representation of the distribution, showcasing the respective counts for each category.

Split Lyric
The split lyric process involves dividing the lyrics into individual sentences, followed by the selection of representative excerpts that correspond to the song's explicit or non-explicit label. This selection is performed manually, taking into consideration the presence of offensive language, depictions of violence, sexual behavior, substance abuse, offensive and discriminatory words, as well as physical or mental abuse within the lyrics. The split lyric stage resulted in a collection of 3,000 lyrics, evenly distributed between explicit and non-explicit classes, with 1,500 data points in each category. Table 3 provides an example of the split lyric results.

Lyric
Label See nigga ain't thugged out of a reason. Niggaz ain't thuggin, because, they like the look nigga. If it ain't one of these bitches the these niggaz won't test. It's been a long time coming. Not a single word comes to mind. I'll think of you.

Explicit Explicit Explicit
Non-Explicit Non-Explicit Non-Explicit

Preprocessing
The preprocessing steps applied to the lyrics include lowercasing, cleaning, stopwords removal, and lemmatization. Lowercasing involves converting all characters/letters to lowercase. Cleaning entails removing punctuation, numbers, and non-ASCII characters from the string. Stopwords removal is performed by eliminating highly frequent words in the language that lack substantial meaning on their own. The stopwords used are a combination of NLTK library, sklearn, and  Table 4 provides an example of the preprocessing results. If it ain't one of these bitches the these niggaz won't test if it ain't one of these bitches the these niggaz won't test.
if it aint one of these bitches the these niggaz wont test. bitches niggaz test bitch niggaz test

Comparison of Classification Model
The classification model was developed using four different data distributions: 90:10, 80:20, 70:30, and 60:40. The random forest algorithm and the TF-IDF vectorization method were employed in constructing the classification model. Performance parameters, including accuracy, precision, recall, and F1-score, were used to evaluate and compare these models. Figure 10 presents a comparison of the classification models. Figure 11. Comparison of Classification Model Figure 11 illustrates the comparison of model classification performance. The model with a training-testing data distribution of 90:10 achieved the highest accuracy, precision, recall, and F1-score. The accuracy reached 96.3%, precision reached 99.3%, recall reached 93.5%, and the F1-score reached 96.3%. These elevated values of accuracy, precision, recall, and F1-score indicate excellent classification performance. Notably, this model surpassed previous studies that utilized the random forest algorithm. Compared to [4], our model exhibited a 12.3% higher accuracy, 7.6% higher precision, 18.4% higher recall, and 13.7% higher F1-score. Furthermore, in comparison to [5], our model demonstrated a 2.8% higher accuracy, 7.3% higher precision, and 2.3% higher F1-score. Given the remarkable values of accuracy, precision, and recall, the 90:10 training-testing data distribution was selected for the classification process.

Visualization of Classification Result
Upon successful completion of the classification process, various aspects of the results are visualized to provide a comprehensive overview. The visualizations include a comparison of the number of songs classified into explicit and nonexplicit categories, the percentage of explicit songs within each genre, the annual growth of explicit songs, the frequency of words in explicit songs, and the singer with the highest number of explicit songs. Figure 12 depicts a visualization of the classification outcomes, revealing that the number of songs classified as non-explicit exceeds that of explicit songs. Out of the total dataset comprising 199,563 songs, 121,299 were classified as non-explicit, while 78,264 were classified as explicit. Despite the significant difference in the number of songs between the two categories, both hold substantial representation within the dataset, each consisting of over 78,000 songs. This demonstrates the significance of explicit and non-explicit classifications, as they provide valuable insights for further analysis, such as identifying music genres that tend to contain explicit content and observing the trends in explicit songs over the years.

Figure 12. Classification results
The percentage of explicit songs is calculated by dividing the total number of explicit songs in each genre by the overall number of songs within that genre, encompassing both explicit and non-explicit categories. Notably, the Hip-Hop/Rap genre exhibits the highest percentage of explicit songs, comprising 91.97% of all songs within that genre. This elevated percentage can be attributed to the frequent use of profanity in hip-hop songs. According to a study [13], hiphop songs contain the highest number of profanity words, with one profanity word occurring approximately every 47 words. Conversely, the jazz genre demonstrates the lowest percentage of explicit songs, accounting for merely 15.72% of the total songs in that genre. Figure 13 provides a visualization depicting the percentage of explicit songs for each genre.  The size of each word corresponds to its frequency of occurrence. Upon analyzing the visualization, it becomes evident that certain words such as 'hell', 'kill', 'drink', 'fck', and 'sht' prominently feature in explicit songs. This observation indicates that explicit songs frequently incorporate violent, vulgar, and aggressive content, which has the potential to influence the perceptions and behaviors of listeners, particularly children and adolescents. Figure 15 presents the visualization illustrating the word frequency in explicit songs. language. Figure 16 showcases the visualization of the top 10 singers with the highest number of explicit songs, based on the web scraping dataset employed in this study. The classification model, utilizing the TF-IDF vectorization method and the random forest algorithm with song lyrics as input, proves to be effective in accurately classifying explicit songs. Performance evaluation metrics, including accuracy, precision, recall, and f1-score, consistently exceed 90%. The optimal model, with a training-testing data distribution of 90:10, achieves exceptional performance with 96.3% accuracy, 99.3% precision, 93.5% recall, and 96.3% f1score. Notably, the accuracy of this model surpasses previous research utilizing similar algorithms and tasks. The utilization of lyric snippets and efficient data processing contribute to the model's superior performance. Visualizing the classification results reveals that explicit songs exhibit fluctuations from year to year, devoid of a discernible trend. This variability can be attributed to various factors, such as shifting musical preferences and changes in the popularity of different genres within a given year. Conversely, the classification results highlight the dominance of the Hip-Hop/Rap genre in the explicit song category. This observation aligns with the visualization findings, which illustrate that singers in the Hip-Hop/Rap genre have the highest number of explicit songs. The explicit nature of this genre is likely influenced by its frequent use of profanity. Thus, the visualization supports the conclusion that the Hip-Hop/Rap genre tends to produce explicit content.