Classification of Explicit Songs Based on Lyrics Using Random Forest Algorithm

  • Luh Kade Devi Dwiyani Universitas Udayana, Indonesia
  • I Made Agus Dwi Suarjaya Udayana University, Indonesia
  • Ni Kadek Dwi Rusjayanthi Udayana University, Indonesia
Keywords: Explicit Song, Machine Learning, TF-IDF, Random Forest, Classification


This study focuses on the potential negative impact of explicit songs on children and adolescents. Although an explicit song labeling program is currently in place, its coverage is limited to songs released by artists affiliated with the Recording Industry Association of America (RIAA). Consequently, songs falling outside the program's scope remain inadequately labeled. To address this issue, a machine learning model was developed to effectively classify explicit songs and mitigate mislabeling challenges. A comprehensive dataset of song lyrics was collected using web scraping techniques for the purpose of constructing the classification model. The model was trained using the TF-IDF vectorization method and the random forest algorithm. A meticulous comparison of distribution parameters was conducted between the training and testing data sets to determine the optimal model. This superior model achieved a training-testing data distribution ratio of 90:10, with an impressive accuracy of 96.3%, precision of 99.3%, recall of 93.5%, and an f1-score of 96.3%. The classification results revealed that explicit songs accounted for 39.22% of the dataset, and the visual representation highlighted the fluctuating prevalence of explicit songs over time. Additionally, the hip-hop/rap genre exhibited the highest proportion of explicit songs, reaching a staggering 92%.


Download data is not yet available.


S. L. Keenan-Kroff et al., “Associations between sexual music lyrics and sexting across adolescence,” Comput. Hum. Behav., vol. 140, p. 107562, Mar. 2023, doi: 10.1016/j.chb.2022.107562.

M. Fell, E. Cabrio, M. Corazza, and F. Gandon, “Comparing Automated Methods to Detect Explicit Content in Song Lyrics,” in Proceedings - Natural Language Processing in a Deep Learning World, Incoma Ltd., Shoumen, Bulgaria, Oct. 2019, pp. 338–344. doi: 10.26615/978-954-452-056-4_039.

Moch. F. Shadiqin Thirafi and F. Rahutomo, “Implementation of Naïve Bayes Classifier Algorithm to Categorize Indonesian Song Lyrics Based on Age,” in 2018 International Conference on Sustainable Information Engineering and Technology (SIET), Malang, Indonesia: IEEE, Nov. 2018, pp. 106–109. doi: 10.1109/SIET.2018.8693201.

L. Bergelid, “Classification of explicit music content using lyrics and music metadata,” Master’s Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2018.

M. A. S. Siddique, M. I. Sarker, R. Ghosh, and K. Gosh, “Toxicity Classification on Music Lyrics Using Machine Learning Algorithms,” in 2021 24th International Conference on Computer and Information Technology (ICCIT), Dhaka, Bangladesh: IEEE, Dec. 2021, pp. 1–5. doi: 10.1109/ICCIT54785.2021.9689865.

H. Chin, J. Kim, Y. Kim, J. Shin, and Mun. Y. Yi, “Explicit Content Detection in Music Lyrics Using Machine Learning,” in 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), Shanghai: IEEE, Jan. 2018, pp. 517–521. doi: 10.1109/BigComp.2018.00085.

M. Khder, “Web Scraping or Web Crawling: State of Art, Techniques, Approaches and Application,” Int. J. Adv. Soft Comput. Its Appl., vol. 13, no. 3, pp. 145–168, Dec. 2021, doi: 10.15849/IJASCA.211128.11.

Y. HaCohen-Kerner, D. Miller, and Y. Yigal, “The influence of preprocessing on text classification using a bag-of-words representation,” PLOS ONE, vol. 15, no. 5, p. e0232525, May 2020, doi: 10.1371/journal.pone.0232525.

H. S. Sint and K. K. Oo, “Comparison of two methods on vector space model for trust in social commerce,” TELKOMNIKA Telecommun. Comput. Electron. Control, vol. 19, no. 3, p. 809, Jun. 2021, doi: 10.12928/telkomnika.v19i3.18150.

R. R. Waliyansyah and N. D. Saputro, “Forecasting New Student Candidates Using the Random Forest Method,” Lontar Komput. J. Ilm. Teknol. Inf., vol. 11, no. 1, p. 44, Apr. 2020, doi: 10.24843/LKJITI.2020.v11.i01.p05.

V. Maslej-Krešňáková, M. Sarnovský, P. Butka, and K. Machová, “Comparison of Deep Learning Models and Various Text Pre-Processing Techniques for the Toxic Comments Classification,” Appl. Sci., vol. 10, no. 23, p. 8631, Dec. 2020, doi: 10.3390/app10238631.

Abdulkareem, Nasiba Mahdi and Abdulazeez, Adnan Mohsin, “Machine Learning Classification Based on Radom Forest Algorithm: A Review,” Jan. 2021, doi: 10.5281/ZENODO.4471118.

C. Thompson, “Lyric-Based Classification of Music Genres Using Hand-Crafted Features,” Reinvention Int. J. Undergrad. Res., vol. 14, no. 2, Oct. 2021, doi: 10.31273/reinvention.v14i2.705.

Abstract views: 46 times
Download PDF: 46 times
How to Cite
Dwiyani, L. K., Suarjaya, I. M. A. D., & Rusjayanthi, N. K. D. (2023). Classification of Explicit Songs Based on Lyrics Using Random Forest Algorithm. Journal of Information Systems and Informatics, 5(2), 550-567.