Class-Level Behavior Analysis under Metric Disagreement in Imbalanced Multi-Label Indonesian Emotion Classification
DOI:
https://doi.org/10.63158/journalisi.v8i3.1664Keywords:
evaluation metrics, metric divergence, multi-label classification, class imbalance, emotion classificationAbstract
This study aims to analyze class-level model behavior under metric disagreement in imbalanced multi-label Indonesian emotion classification, using the divergence between Macro F1 and Micro F1 as a diagnostic signal rather than a mere performance indicator. A machine-translated Indonesian version of the GoEmotions dataset, comprising approximately 58,000 samples across 28 fine-grained emotion categories, is used as the experimental setting. The translated dataset was not manually revalidated, and findings are scoped to this translated GoEmotions setting. Two transformer-based models are evaluated: IndoBERT, a monolingual Indonesian model, and DistilBERT, a multilingual model, both fine-tuned with class-specific threshold optimization. The results reveal opposing divergence patterns: IndoBERT achieves higher Micro F1 than Macro F1, indicating performance concentrated on high-frequency classes, while DistilBERT exhibits the reverse pattern, suggesting broader but less precise label activation. Per-class analysis further shows that most minority classes consistently fall into unstable or non-functional performance regimes across both models. This study concludes that aggregate metrics alone are insufficient for evaluating model behavior in imbalanced multi-label settings. A behavior-oriented interpretation framework for Macro–Micro F1 divergence and a regime-based class reliability categorization are proposed to support more structured and informative evaluation practices.
Downloads
References
[1] O. Rainio, J. Teuho, and R. Klén, “Evaluation metrics and statistical tests for machine learning,” Sci. Rep., vol. 14, no. 1, p. 6086, Mar. 2024, doi: 10.1038/s41598-024-56706-x.
[2] S. Ossenov, “Developing a Dataset-Adaptive, Normalized Metric for Machine Learning Model Assessment: Integrating Size, Complexity, and Class Imbalance,” arXiv preprint arXiv: 2412.07244, 2024. Accessed: May 27, 2026. [Online]. Available: https://arxiv.org/abs/2412.07244
[3] M. C. Hinojosa Lee, J. Braet, and J. Springael, “Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores,” Applied Sciences, vol. 14, no. 21, p. 9863, Oct. 2024, doi: 10.3390/app14219863.
[4] S. Roohi, R. Skarbez, and H. D. Nguyen, “Reliable uncertainty estimation in emotion recognition in conversation using conformal prediction framework,” Natural Language Processing, vol. 31, no. 5, pp. 1163–1186, Sep. 2025, doi: 10.1017/nlp.2024.48.
[5] D. Harbecke, Y. Chen, L. Hennig, and C. Alt, “Why only Micro-F1? Class Weighting of Measures for Relation Classification,” in Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP, T. Shavrina, V. Mikhailov, V. Malykh, E. Artemova, O. Serikov, and V. Protasov, Eds., Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 32–41. doi: 10.18653/v1/2022.nlppower-1.4.
[6] Y. Xia, Q. Zhao, Y. Long, G. Xu, and J. Wang, “SensoryT5: Infusing Sensorimotor Norms into T5 for Enhanced Fine-grained Emotion Classification,” in Proc. Workshop on Cognitive Aspects of the Lexicon (CogALex), Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 144–152, doi: 10.18653/v1/2024.cogalex-1.19.
[7] B. Pithava, A. Magar, and S. Bharti, “Unveiling Sentiment Dynamics: Emotion Detection in Social Media,” in 2024 International Conference on Intelligent Computing and Emerging Communication Technologies (ICEC), IEEE, Nov. 2024, pp. 1–6. doi: 10.1109/ICEC59683.2024.10837523.
[8] Z. Su, H. Lyu, Y. Niu, and Y. Liu, “Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement,” arXiv preprint arXiv: 2511.14073, 2025, Accessed: May 27, 2026. [Online]. Available: https://arxiv.org/abs/2511.14073
[9] R. Chauhan, A. Gusain, P. Kumar, C. Bhatt, and I. Uniyal, “Fine Grained Sentiment Analysis using Machine Learning and Deep Learning,” in 2023 International Conference on Sustainable Emerging Innovations in Engineering and Technology (ICSEIET), IEEE, Sep. 2023, pp. 423–427. doi: 10.1109/ICSEIET58677.2023.10303481.
[10] A. Sharma, A. Avasthi, V. L. Vangipuram, P. G., S. V., and T. C. Manjunath, “Exploring Emotion Psychology in AI: Common Perspectives and Their Application in Research and Development to Enhance Empathetic Responses in Artificial Intelligence Systems,” in 2025 7th International Conference on Information Systems and Computer Networks (ISCON), IEEE, Sep. 2025, pp. 1–6. doi: 10.1109/ISCON65210.2025.11341720.
[11] D. Demszky, D. Movshovitz-Attias, J. Ko, A. Cowen, G. Nemade, and S. Ravi, "GoEmotions: A Dataset of Fine-Grained Emotions," in Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020, pp. 4040–4054, doi: 10.18653/v1/2020.acl-main.372.
[12] L. Piras, L. Boratto, and G. Ramos, “Evaluating the Prediction Bias Induced by Label Imbalance in Multi-label Classification,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management, New York, NY, USA: ACM, Oct. 2021, pp. 3368–3372. doi: 10.1145/3459637.3482100.
[13] F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” in Proceedings of the 28th International Conference on Computational Linguistics, Stroudsburg, PA, USA: International Committee on Computational Linguistics, 2020, pp. 757–770. doi: 10.18653/v1/2020.coling-main.66.
[14] M. R. Syazali and E. Yulianti, “Classification of Economic Activities in Indonesia Using IndoBERT Language Model,” Jurnal Ilmu Komputer dan Informasi, vol. 18, no. 2, pp. 155–165, Jun. 2025, doi: 10.21609/jiki.v18i2.1446.
[15] C. Shaw, P. LaCasse, and L. Champagne, “Exploring emotion classification of indonesian tweets using large scale transfer learning via IndoBERT,” Soc. Netw. Anal. Min., vol. 15, no. 1, Dec. 2025, doi: 10.1007/s13278-025-01439-6.
[16] W. Wongso, D. S. Setiawan, S. Limcorn, and A. Joyoadikusumo, “NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural,” in Proc. Second Workshop in South East Asian Language Processing (SEALP), Online: Association for Computational Linguistics, Jan. 2025, pp. 10–26, doi: 10.18653/v1/2025.sealp-1.2.
[17] W. Christian, D. Adamlu, A. Yu, and D. Suhartono, “Leveraging IndoBERT and DistilBERT for Indonesian emotion classification in e-commerce reviews,” Procedia Comput. Sci., vol. 269, pp. 321–330, 2025, doi: 10.1016/j.procs.2025.08.284.
[18] E. I. Setiawan, L. Kristianto, A. T. Hermawan, J. Santoso, K. Fujisawa, and M. H. Purnomo, “Social Media Emotion Analysis in Indonesian Using Fine-Tuning BERT Model,” in 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT), IEEE, Apr. 2021, pp. 334–337. doi: 10.1109/EIConCIT50028.2021.9431885.
[19] S. Goldfarb-Tarrant, B. Ross, and A. Lopez, “Cross-lingual Transfer Can Worsen Bias in Sentiment Analysis,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds., Singapore: Association for Computational Linguistics, Dec. 2023, pp. 5691–5704. doi: 10.18653/v1/2023.emnlp-main.346.
[20] J. Li et al., “A Two-Stage Framework for Ambiguous Classification in Software Engineering,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), IEEE, Oct. 2023, pp. 275–286. doi: 10.1109/ISSRE59848.2023.00070.
[21] A. Amalia, M. S. Lydia, P. I. Nainggolan, Nurrahmadayeni, S. Br Siagian, and D. S. Br Ginting, “Multi-Label Emotion Classification for Indonesian Text using IndoBERT Fine-Tuning,” in 2025 9th International Conference on Electrical, Telecommunication and Computer Engineering (ELTICOM), IEEE, Nov. 2025, pp. 293–299. doi: 10.1109/ELTICOM67568.2025.11336043.
[22] R. Kumar, R. K. Ayyasamy, and A. K. Jebna, “Long-Tail Emotion Detection: Few-Shot Learning for Rare Pandemic Emotions via Prototype Networks,” Journal of Advanced Research in Applied Sciences and Engineering Technology, vol. 55, no. 1, pp. 236–244, Aug. 2025, doi: 10.37934/araset.55.1.236244.
[23] N. V. S. J. Jami et al., “Stratify or Die: Rethinking Data Splits in Image Segmentation,” arXiv preprint arXiv: 2509.21056, 2025. Accessed: May 27, 2026. [Online]. Available: https://arxiv.org/abs/2509.21056
[24] T. T. Inan, M. Liu, and A. Shehu, "F-Measure Optimization for Multi-class, Imbalanced Emotion Classification Tasks," in Artificial Neural Networks and Machine Learning – ICANN 2022, Lecture Notes in Computer Science, vol. 13529, Springer, 2022, pp. 158–170, doi: 10.1007/978-3-031-15919-0_14.
[25] S. Simhadri, M. Ponnam, R. Rajitha, and R. Balamurugan, "Enhanced Multi-Class Model Evaluation: Analyzing BERT, GPT-2, and LLaMA with Precision, Recall, and F1-Score Metrics," in Proc. 4th Int. Conf. Innovative Mechanisms for Industry Applications (ICIMIA), IEEE, 2025, pp. 984–989, doi: 10.1109/ICIMIA67127.2025.11200914.
[26] R. Vinston Raja et al., “Metrics and Techniques for Evaluating Machine Learning Models and Optimization Algorithms,” in AI Model Design and Data Management for Disease Prediction, A. Muniasamy, Ed., IGI Global Scientific Publishing, 2025, pp. 193–222, doi: 10.4018/979-8-3373-5137-7.ch007.
[27] B. Wilie et al., "IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding," in Proc. Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th Int. Joint Conf. Natural Language Processing (AACL-IJCNLP), 2020, pp. 843–857, doi: 10.18653/v1/2020.aacl-main.85.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Journal of Information Systems and Informatics

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors Declaration
- The Authors certify that they have read, understood, and agreed to the Journal of Information Systems and Informatics (JournalISI) submission guidelines, policies, and submission declaration. The submission has been prepared using the provided template.
- The Authors certify that all authors have approved the publication of this manuscript and that there is no conflict of interest.
- The Authors confirm that the manuscript is their original work, has not received prior publication, is not under consideration for publication elsewhere, and has not been previously published.
- The Authors confirm that all authors listed on the title page have contributed significantly to the work, have read the manuscript, attest to the validity and legitimacy of the data and its interpretation, and agree to its submission.
- The Authors confirm that the manuscript is not copied from or plagiarized from any other published work.
- The Authors declare that the manuscript will not be submitted for publication in any other journal or magazine until a decision is made by the journal editors.
- If the manuscript is finally accepted for publication, the Authors confirm that they will either proceed with publication immediately or withdraw the manuscript in accordance with the journal’s withdrawal policies.
- The Authors agree that, upon publication of the manuscript in this journal, they transfer copyright or assign exclusive rights to the publisher, including commercial rights














