Analysis of Document Clustering based on Cosine Similarity and K-Main Algorithms

  • Bambang Krismono Triwijoyo Universitas Bumigora
  • Kartarina Kartarina Universitas Bumigora
Keywords: document clustering, cosine similarity, k-main


Clustering is a useful technique that organizes a large number of non-sequential text documents into a small number of clusters that are meaningful and coherent. Effective and efficient organization of documents is needed, making it easy for intuitive and informative tracking mechanisms. In this paper, we proposed clustering documents using cosine similarity and k-main. The experimental results show that based on the experimental results the accuracy of our method is 84.3%.


Download data is not yet available.

Author Biography

Kartarina Kartarina, Universitas Bumigora

Lecturer at Information Technology Departement, Bumigora University 


G. Salton and M. J. McGill, "Introduction to Modern Information Retrieval", McGraw-Hill, 1983.

R. Baeza-Yates, & B. D. A. N Ribeiro, "Modern information retrieval". New York: ACM Press; Harlow, England: Addison-Wesley, 2011.

M.F. Porter. "An algorithm for suffix stripping". Program, 14(3), 130-137. 1980.

A.K. Jain, M.N. Murty, & P.J. Flynn. "Data clustering: a review". ACM computing surveys (CSUR), 31(3), 264-323. 1999.

P. Willett. "Recent trends in hierarchic document clustering: a critical review". Information Processing and Management: an International Journal, 24(5):577–597, 1988.

J. Mcqueen. "Some methods for classification and analysis of multivariate observations". In Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, pages 281–297, Berkeley, CA, 1967.

A. K. Jain and R. C. Dubes. "Algorithms for Clustering Data". Prentice-Hall, Inc., Upper Saddle River, NJ, 1988.

L. Baker and A. McCallum. "Distributional clustering of words for text classification". In Proc. 1998 Int. Conf. on Research and Development in Information Retrieval (SI- GIR’98), pages 96–103, Melbourne, Australia, Aug. 1998.

C. Ding, X. He, H. Zha, M. Gu, and H. D. Simon. "A min-max cut algorithm for graph partitioning and data clustering". In Proc. 2001 Int. Conf. Data Mining (ICDM’01), pages 107–114, San Jose, CA, Nov. 2001.

M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. "A density-based algorithm for discovering clusters in large spatial databases with noise". In Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining (KDD’96), pages 226–231, Portland, Oregon, Aug. 1996.

J. Shi and J. Malik. "Normalized cuts and image segmentation". IEEE Trans. on PAMI, 22(8):888–905, 2000.

Andrew Y. Ng, Michael Jordan, and Yair Weiss. "On spectral clustering: Analysis and an algorithm". In Advances in Neural Information Processing Systems 14, pages 849–856. MIT Press, Cambridge, MA, 2001.

H. Zha, C. Ding, M. Gu, X. He, and H. Simon. "Spectral relaxation for k-means clustering". In Advances in Neural Information Processing Systems 14, pages 1057–1064. MIT Press, Cambridge, MA, 2001.

Wei Xu, Xin Liu, and Yihong Gong. "Document clustering based on non-negative matrix factorization". In Proc. 2003 Int. Conf. on Research and Development in Information Retrieval (SIGIR’03), pages 267–273, Toronto, Canada, Aug. 2003.

Wei Xu and Yihong Gong. "Document clustering by concept factorization". In Proc. 2004 Int. Conf. on Research and Development in Information Retrieval (SIGIR’04), pages 202–209, Sheffield, UK, July 2004.

A. Anggrawan, K. Hidjah & Q.S. Jihadil. "Kidney failure diagnosis based on case-based reasoning (CBR) method and statistical analysis". In Informatics and Computing (ICIC), International Conference on (pp. 298-303). IEEE. 2016.

A. Anggrawan, & A. Azhari. "Aplikasi Deteksi Kemiripan Tugas Paper". Jurnal Matrik, 15(2), 5-10. 2016.

G. Salton. "Automatic Text Processing". Addison-Wesley, New York, 1989.

P. K. Chan, D. F. Schlag, and J. Y. Zien. "Spectral k-way ratio-cut partitioning and clustering". IEEE Trans. Computer-Aided Design, 13:1088–1096, 1994.

S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. harshman. "Indexing by latent semantic analysis". In Journal of the American Society of Information Science, 41(6):391–407, 1990.

V. Singh and B. Saini, "An effective tokenization algorithm for information", pp. 109–119, 2014.

C. J. van Rijsbergen, "Information Retrieval", 2nd Edition, Butterworths, London, 1979.

M.F. Porter. "An Algorithm for Suffix Stripping", Program, 14(3): 130-137. 1980.

N. Sandhya, Y. S. Lalitha, V. Sowmya, K. Anuradha, and A. Govardhan, "Analysis of Stemming Algorithm for Text Clustering", vol. 8, no. 5, pp. 352–359, 2011.

G. Salton. "Automatic Text Processing". Addison-Wesley, New York, 1989.

M. Steinbach, G. Karypis, and V. Kumar. "A Comparison of Document Clustering Techniques". In KDD Workshop on Text Mining, 2000.

D. R. Cutting, J. O. Pedersen, D. R. Karger, and J. W. Tukey. "Scatter/gather: A cluster-based approach to browsing large document collections". In Proceedings of the ACM SIGIR, 1992.

B. Larsen and C. Aone. "Fast and Effective Text Mining using Linear-time Document Clustering". In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1999.

D. Arthur and S. Vassilvitskii. "k-means++ the advantages of careful seeding". In Symposium on Discrete Algorithms, 2007.

Abstract views: 227 times
Download PDF: 165 times
How to Cite
Triwijoyo, B., & Kartarina, K. (2019). Analysis of Document Clustering based on Cosine Similarity and K-Main Algorithms. Journal of Information Systems and Informatics, 1(2), 164-177.