Classification of Explicit Songs Based on Lyrics Using Random Forest Algorithm
Abstract
This study focuses on the potential negative impact of explicit songs on children and adolescents. Although an explicit song labeling program is currently in place, its coverage is limited to songs released by artists affiliated with the Recording Industry Association of America (RIAA). Consequently, songs falling outside the program's scope remain inadequately labeled. To address this issue, a machine learning model was developed to effectively classify explicit songs and mitigate mislabeling challenges. A comprehensive dataset of song lyrics was collected using web scraping techniques for the purpose of constructing the classification model. The model was trained using the TF-IDF vectorization method and the random forest algorithm. A meticulous comparison of distribution parameters was conducted between the training and testing data sets to determine the optimal model. This superior model achieved a training-testing data distribution ratio of 90:10, with an impressive accuracy of 96.3%, precision of 99.3%, recall of 93.5%, and an f1-score of 96.3%. The classification results revealed that explicit songs accounted for 39.22% of the dataset, and the visual representation highlighted the fluctuating prevalence of explicit songs over time. Additionally, the hip-hop/rap genre exhibited the highest proportion of explicit songs, reaching a staggering 92%.
Downloads
References
S. L. Keenan-Kroff et al., “Associations between sexual music lyrics and sexting across adolescence,” Comput. Hum. Behav., vol. 140, p. 107562, Mar. 2023, doi: 10.1016/j.chb.2022.107562.
M. Fell, E. Cabrio, M. Corazza, and F. Gandon, “Comparing Automated Methods to Detect Explicit Content in Song Lyrics,” in Proceedings - Natural Language Processing in a Deep Learning World, Incoma Ltd., Shoumen, Bulgaria, Oct. 2019, pp. 338–344. doi: 10.26615/978-954-452-056-4_039.
Moch. F. Shadiqin Thirafi and F. Rahutomo, “Implementation of Naïve Bayes Classifier Algorithm to Categorize Indonesian Song Lyrics Based on Age,” in 2018 International Conference on Sustainable Information Engineering and Technology (SIET), Malang, Indonesia: IEEE, Nov. 2018, pp. 106–109. doi: 10.1109/SIET.2018.8693201.
L. Bergelid, “Classification of explicit music content using lyrics and music metadata,” Master’s Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2018.
M. A. S. Siddique, M. I. Sarker, R. Ghosh, and K. Gosh, “Toxicity Classification on Music Lyrics Using Machine Learning Algorithms,” in 2021 24th International Conference on Computer and Information Technology (ICCIT), Dhaka, Bangladesh: IEEE, Dec. 2021, pp. 1–5. doi: 10.1109/ICCIT54785.2021.9689865.
H. Chin, J. Kim, Y. Kim, J. Shin, and Mun. Y. Yi, “Explicit Content Detection in Music Lyrics Using Machine Learning,” in 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), Shanghai: IEEE, Jan. 2018, pp. 517–521. doi: 10.1109/BigComp.2018.00085.
M. Khder, “Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application,” Int. J. Adv. Soft Comput. Its Appl., vol. 13, no. 3, pp. 145–168, Dec. 2021, doi: 10.15849/IJASCA.211128.11.
Y. HaCohen-Kerner, D. Miller, and Y. Yigal, “The influence of preprocessing on text classification using a bag-of-words representation,” PLOS ONE, vol. 15, no. 5, p. e0232525, May 2020, doi: 10.1371/journal.pone.0232525.
H. S. Sint and K. K. Oo, “Comparison of two methods on vector space model for trust in social commerce,” TELKOMNIKA Telecommun. Comput. Electron. Control, vol. 19, no. 3, p. 809, Jun. 2021, doi: 10.12928/telkomnika.v19i3.18150.
R. R. Waliyansyah and N. D. Saputro, “Forecasting New Student Candidates Using the Random Forest Method,” Lontar Komput. J. Ilm. Teknol. Inf., vol. 11, no. 1, p. 44, Apr. 2020, doi: 10.24843/LKJITI.2020.v11.i01.p05.
V. Maslej-Krešňáková, M. Sarnovský, P. Butka, and K. Machová, “Comparison of Deep Learning Models and Various Text Pre-Processing Techniques for the Toxic Comments Classification,” Appl. Sci., vol. 10, no. 23, p. 8631, Dec. 2020, doi: 10.3390/app10238631.
Abdulkareem, Nasiba Mahdi and Abdulazeez, Adnan Mohsin, “Machine Learning Classification Based on Radom Forest Algorithm: A Review,” Jan. 2021, doi: 10.5281/ZENODO.4471118.
C. Thompson, “Lyric-Based Classification of Music Genres Using Hand-Crafted Features,” Reinvention Int. J. Undergrad. Res., vol. 14, no. 2, Oct. 2021, doi: 10.31273/reinvention.v14i2.705.


Copyright (c) 2023 Journal of Information Systems and Informatics

This work is licensed under a Creative Commons Attribution 4.0 International License.
- I certify that I have read, understand and agreed to the Journal of Information Systems and Informatics (Journal-ISI) submission guidelines, policies and submission declaration. Submission already using the provided template.
- I certify that all authors have approved the publication of this and there is no conflict of interest.
- I confirm that the manuscript is the authors' original work and the manuscript has not received prior publication and is not under consideration for publication elsewhere and has not been previously published.
- I confirm that all authors listed on the title page have contributed significantly to the work, have read the manuscript, attest to the validity and legitimacy of the data and its interpretation, and agree to its submission.
- I confirm that the paper now submitted is not copied or plagiarized version of some other published work.
- I declare that I shall not submit the paper for publication in any other Journal or Magazine till the decision is made by journal editors.
- If the paper is finally accepted by the journal for publication, I confirm that I will either publish the paper immediately or withdraw it according to withdrawal policies
- I Agree that the paper published by this journal, I transfer copyright or assign exclusive rights to the publisher (including commercial rights)