Document Type : Original Paper

Authors

Department of Statistics, Tarbiat Modares University, Tehran, Islamic Republic of Iran

Abstract

Clustering, a fundamental multivariate statistical method, serves as a valuable tool for extracting meaningful insights from complex datasets. Analyzing high-dimensional data, however, presents challenges, notably the curse of dimensionality. While various methods have been developed to address the dimensionality reduction, most overlooked the role of dependent variables. In contrast, supervised clustering leverages the inherent information in response variables, offering substantial benefits in data dimension reduction and accelerating clustering computations. This paper evaluates the efficacy of supervised clustering in the analysis of Persian handwritten images. Focusing on the multi-class nature of Persian handwritten data, the identification of important variables for each digit not only reduces data dimensionality but also this reduction in dimensionality does not compromise the accuracy of predicting new observations at any stage of the algorithm. Additionally, the approach demonstrates relatively high accuracy in predicting the response variable. This study contributes a novel perspective toward clustering methods, highlighting the integration of supervised techniques for improved performance in high-dimensional data analysis.

Keywords

Main Subjects

  1. Izenman AJ. Modern multivariate statistical techniques. New York: Springer; 2008.
  2. McLachlan GJ, Basford KE. Mixture models: Inference and applications to clustering. New York: M. Dekker; 1988 Jul.

 

  1. Bühlmann P, Van De Geer S. Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media; 2011 Jun 8.
  2. Fraiman R, Justel A, Svarc M. Selection of variables for cluster analysis and classification rules. Journal of the American Statistical Association. 2008 Sep 1:1294-303.
  3. Hastie T, Tibshirani R, Wainwright M. Statistical learning with sparsity: the lasso and generalizations. CRC press; 2015 May 7.
  4. Dettling M, Bühlmann P. Supervised clustering of genes. Genome biology. 2002 Dec;3:1-5.
  5. Hastie T, Tibshirani R, Botstein D, Brown P. Supervised harvesting of expression trees. Genome Biology. 2001 Jan;2(1):1-2.

 

  1. Allwein EL, Schapire RE, Singer Y. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of machine learning research. 2000;1(Dec):113-41.
  2. Liu CL, Nakashima K, Sako H, Fujisawa H. Handwritten digit recognition: benchmarking of state-of-the-art techniques. Pattern recognition. 2003 Oct 1;36(10):2271-85.
  3. Mayraz G, Hinton GE. Recognizing hand-written digits using hierarchical products of experts. Advances in neural information processing systems. 2000;13.
  4. Trier ØD, Jain AK, Taxt T. Feature extraction methods for character recognition-a survey. Pattern recognition. 1996 Apr 1;29(4):641-62.
  5. Suen CY, Nadal C, Legault R, Mai TA, Lam L. Computer recognition of unconstrained handwritten numerals. Proceedings of the IEEE. 1992 Jul;80(7):1162-80.
  6. Khosravi H, Kabir E. Introducing a very large dataset of handwritten Farsi digits and a study on their varieties. Pattern recognition letters. 2007 Jul 15;28(10):1133-41.
  7. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2001;63(2):411-23.
  8. Dettling M, Bühlmann P. Finding predictive gene groups from microarray data. Journal of Multivariate Analysis. 2004 Jul 1;90(1):106-31.
  9. Freund Y, Mason L. The alternating decision tree learning algorithm. Inicml 1999 Jun 27 (Vol. 99, pp. 124-133).
  10. Jing L, Tian Y. Self-supervised visual feature learning with deep neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence. 2020 May 4;43(11):4037-58.
  11. Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995 Sep;20:273-97.
  12. Haponchyk I, Moschitti A. Supervised neural clustering via latent structured output learning: application to question intents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2021 Jun (pp. 3364-3374).
  13. Yang Y, Teng F, Li T, Wang H, Wang H, Zhang Q. Parallel semi-supervised multi-ant colonies clustering ensemble based on mapreduce methodology. IEEE Transactions on Cloud Computing. 2015 Dec 23;6(3):857-67.