An Empirical Evaluation of Confidence Miscalibration in Vanilla BERT-Based Stress Detection on Social Media

Rizaldi; Kusrini; Utami Ema; Agastya I Made Artha

doi:10.63158/journalisi.v8i3.1634

Authors

Rizaldi Universitas Amikom Yogyakarta, Indonesia
Kusrini Universitas Amikom Yogyakarta, Indonesia
Utami Ema Universitas Amikom Yogyakarta, Indonesia
Agastya I Made Artha Universitas Amikom Yogyakarta, Indonesia

DOI:

https://doi.org/10.63158/journalisi.v8i3.1634

Keywords:

Stress detection, Vanilla BERT, expected calibration error, reliability diagram, uncertainty estimation

Abstract

This study evaluates the reliability of confidence estimates produced by a Vanilla BERT classifier for stress detection using the Dreaddit benchmark. BERT-base-uncased was fine-tuned on 3,553 labeled text segments, following the standard split of 2,838 training samples and 715 test samples. The model was assessed as a single diagnostic baseline without additional linguistic features, label smoothing, post-hoc calibration, or other calibration interventions. Evaluation was conducted using discriminative performance metrics, including accuracy, precision, recall, and F1-score, as well as probabilistic reliability metrics, including Brier Score, Expected Calibration Error, Adaptive Calibration Error, and a reliability diagram. The Vanilla BERT model achieved 79.02% accuracy, 78.00% precision, 82.65% recall, and 80.26% F1-score, indicating competitive classification performance for stress detection. However, the calibration results revealed noticeable miscalibration, with a Brier Score of 0.1565, Expected Calibration Error of 0.0847, and Adaptive Calibration Error of 0.0880. The most prominent confidence mismatch occurred in the 0.8–0.9 confidence interval, while the 0.9–1.0 interval contributed the most to Expected Calibration Error due to its larger sample proportion. These findings show that although Vanilla BERT performs reasonably well in distinguishing stressed from non-stressed text, its confidence estimates are not fully reliable. Therefore, this study positions Vanilla BERT as a diagnostic reliability baseline and emphasizes the importance of evaluating stress detection models using both classification performance and probabilistic calibration criteria.

Downloads

Download data is not yet available.

References

[1] X. Sun, B. J. Li, H. Zhang, and G. Zhang, “Social media use for coping with stress and psychological adjustment: A transactional model of stress and coping perspective,” Front. Psychol., vol. 14, 2023, doi: 10.3389/fpsyg.2023.1140312.

[2] E. Turcan and K. McKeown, “Dreaddit: A Reddit Dataset for Stress Analysis in Social Media,” in Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019), Hong Kong: Association for Computational Linguistics, Oct. 2019, pp. 97–107. doi: 10.18653/v1/D19-6213.

[3] A. Pourkeyvan, R. Safa, and A. Sorourkhah, “Harnessing the Power of Hugging Face Transformers for Predicting Mental Health Disorders in Social Networks,” IEEE Access, vol. 12, pp. 28025–28035, 2024, doi: 10.1109/ACCESS.2024.3366653.

[4] M. Sao and H. J. Lim, “MIRoBERTa: Mental Illness Text Classification With Transfer Learning on Subreddits,” IEEE Access, vol. 12, pp. 197454–197466, 2024, doi: 10.1109/ACCESS.2024.3522465.

[5] A. Karamat, M. Imran, M. U. Yaseen, R. Bukhsh, S. Aslam, and N. Ashraf, “A Hybrid Transformer Architecture for Multiclass Mental Illness Prediction Using Social Media Text,” IEEE Access, vol. 13, pp. 12148–12167, 2025, doi: 10.1109/ACCESS.2024.3519308.

[6] L. Ilias, S. Mouzakitis, and D. Askounis, “Calibration of Transformer-Based Models for Identifying Stress and Depression in Social Media,” IEEE Trans. Comput. Soc. Syst., vol. 11, no. 2, pp. 1979–1990, Apr. 2024, doi: 10.1109/TCSS.2023.3283009.

[7] N. Oryngozha, P. Shamoi, and A. Igali, “Detection and Analysis of Stress-Related Posts in Reddit’s Acamedic Communities,” IEEE Access, vol. 12, pp. 14932–14948, 2024, doi: 10.1109/ACCESS.2024.3357662.

[8] J. Gawlikowski et al., “A survey of uncertainty in deep neural networks,” Artif. Intell. Rev., vol. 56, pp. 1513–1589, Oct. 2023, doi: 10.1007/s10462-023-10562-9.

[9] J. Geng, F. Cai, Y. Wang, H. Koeppl, P. Nakov, and I. Gurevych, “A Survey of Confidence Estimation and Calibration in Large Language Models,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 6577–6595. doi: 10.18653/v1/2024.naacl-long.366.

[10] S. Roohi, R. Skarbez, and H. D. Nguyen, “Reliable uncertainty estimation in emotion recognition in conversation using conformal prediction framework,” Natural Language Processing, vol. 31, no. 5, pp. 1163–1186, Sep. 2025, doi: 10.1017/nlp.2024.48.

[11] J.-Q. Yang, D.-C. Zhan, and L. Gan, “Beyond Probability Partitions: Calibrating Neural Networks with Semantic Aware Grouping Appendix,” in Advances in Neural Information Processing Systems, New Orleans, Louisiana, USA: Neural Information Processing Systems Foundation, 2023, pp. 58448–58460. Accessed: May 01, 2026.

[12] D. Angelov, “Top2Vec: Distributed Representations of Topics,” arXiv preprint arXiv:2008.09470, Aug. 2020, Accessed: May 17, 2026. [Online]. Available: https://arxiv.org/abs/2008.09470

[13] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On Calibration of Modern Neural Networks,” in Proceedings of the 34th International Conference on Machine Learning, Sydney, NSW, Australia: PMLR, 2017, pp. 1321–1330.

[14] J. Devlin, M.-W. Chang, and K. Lee, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423.

[15] C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to Fine-Tune BERT for Text Classification?” in Chinese Computational Linguistics, Cham, Switzerland: Springer, 2019, pp. 194–206. doi: 10.1007/978-3-030-32381-3_16.

An Empirical Evaluation of Confidence Miscalibration in Vanilla BERT-Based Stress Detection on Social Media

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

Most read articles by the same author(s)

publisher

sidebar

certificate

template

gs-citation

index

stat