Research in Data Clustering

[Home] [ Theses] [ Software] [ Publications]

For our research in Pattern Recognition and Image Processing, visit the PRIP page
For our research in biometrics, visit the Biometrics page

Model Based Clustering

Large Scale Kernel-Based Data Clustering

Kernel-based clustering algorithms achieve better performance on real world data than the Euclidean distance-based clustering algorithms, but pose two important challenges: (i) they do not scale sufficiently in terms of run-time and memory complexity, i.e. their complexity is quadratic in the number of data instances, rendering them inefficient for large data sets containing millions of data points, and (ii) the choice of the kernel function is very critical to the performance of the algorithm. In this project, we aim at developing efficient schemes to reduce the complexity of these clustering algorithms and learn appropriate kernel functions from the data. We employ matrix approximation techniques based on randomization to achieve speedup and reduce the memory requirements of kernel-based clustering. We evaluate the efficiency of our techniques in the domains of object categorization and document clustering.

Radha Chitta, Rong Jin, A. K. Jain. "Efficient Kernel Clustering using Random Fourier Features", ICDM, Brussels, Belgium, Dec. 10-13, 2012
Radha Chitta, Timothy .C. Havens, Rong Jin, A. K. Jain. "Approximate Kernel k-means: solution to Large Scale Kernel Clustering", KDD, San Diego, CA, August 21-24, 2011.

Crowdclustering

One of the main challenges in data clustering is to define an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by defining the pairwise similarity based on the manual annotations obtained through crowdsourcing.

Jinfeng Yi, Rong Jin, A. K. Jain. S. Jain, Tianbao Yang. "Semi-Crowdsourced Clustering: Generalizing Crowd Labeling by Robust Distance Metric Learning", NIPS, Lake Tahoe, NE, Dec. 3-6, 2012.
Jinfeng Yi, Rong Jin, A. K. Jain. S. Jain. "Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach", HCOMP, Toronto, Canada, July 22-23, 2012.

Semi-supervised Boosting

Pavan K. Mallapragada, Rong Jin, A. K. Jain, Yi Liu . "SemiBoost: Boosting for Semi-supervised Learning", Transactions on Pattern Analysis and Machine Intelligence (to appear).
Pavan K. Mallapragada, Rong Jin, A. K. Jain, Yi Liu . "SemiBoost: Boosting for Semi-supervised Learning", Technical Report MSU-CSE-07-197, Department of Computer Science and Engineering, Michigan State University.

Model Based Clustering

The standard algorithm for fitting a mixture of Gaussians to a data set is the classic EM algorithm. However, EM algorithm has several known weaknesses: the number of components needs to be fixed beforehand, EM can converge to a poor local optimum, and EM can converge towards a singular estimate at the boundary of the parameter space. These issues are addressed by the algorithm described in the following papers. The Matlab code for our algorithm is available for download.

M. Figueiredo, A.K. Jain, "Unsupervised Learning of Finite Mixture Models", IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 3, March 2002, pp. 381-396. (Matlab code) (abstract at IEEE Explore)
M. Figueiredo, A.K. Jain, "Unsupervised Selection and Estimation of Finite Mixture Models"; in Proceedings of the International Conference on Pattern Recognition - ICPR'2000, vol. 2, pp. 87-90, Barcelona, September 2000. (ps.gz, pdf)

Multiobjective Data Clustering

Most clustering algorithms generate the output partition by explicitly or implicitly minimizing a single objective function. Unfortunately, clusters in real world data sets are "heterogeneous" (of diverse shapes and data densities), and it is difficult for a single clustering algorithm to detect different types of clusters. We explore how to use multiple clustering criteria simultaneously in the following paper.

M. Law, A. Topchy, A. K. Jain. "Multiobjective Data Clustering", In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 424-430, 2004.

Cluster Ensembles

Combination of multiple classifiers in supervised classification has achieved great success and it is becoming one of the standard techniques in pattern recognition. However, little has been done to explore how to combine data partitions generated by different clustering algorithms. The following papers investigate different issues on combining the outputs of multiple clustering algorithms.

A. Fred, A.K. Jain. Combining Multiple Clustering Using Evidence Accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, number 6, pp. 835-850, 2005. (Abstract in IEEE explore)
A. Topchy, A.K. Jain,W. Punch. Clustering Ensembles: Models of Consensus and Weak Partitions. To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence (under review).
A. Topchy, M. H. Law. A.K. Jain, A. Fred. Analysis of Consensus Partition in Cluster Ensemble. In Proceedings of The Fourth IEEE International Conference on Data Mining, pp. 225-232, Brighton, UK, November 01-04, 2004.
A. Topchy, B. Minaei, A.K. Jain, W. Punch. "Adaptive clustering Ensembles", in Proceedings of the International Conference on Pattern Recognition, Cambridge, United Kingdom, August 23-26, 2004.
A. Topchy, A.K. Jain, W. Punch, "A Mixture Model of Clustering Ensembles", in Proceedings of the SIAM International Conference on Data Mining, Lake Buena Vista, Florida, April 22-24, 2004.
B. Minaei, A. Topchy, and W. Punch, Ensembles of Partitions via Data Resampling, in Proceedings of the International Conference on Information Technology: Coding and Computing, ITCC 2004, Las Vegas, April 2004
A. Topchy, A.K. Jain, W. Punch, "Combining Multiple Weak Clusterings", in Proceedings of the IEEE International Conf. Data Mining, pp. 331-338, Melbourne, Florida, USA, November 19-22 2003.
A. Fred, A.K. Jain, "Data Clustering Using Evidence Accumulation", in Proceedings of the International Conference on Pattern Recognition (ICPR), Quebec City, August 11-15 2002.
A. Fred, A.K. Jain, "Evidence Accumulation Clustering based on the K-means algorithm", in Proceedings of the International Workshops on Structural and Syntactic Pattern Recognition (SSPR), Windsor, Canada, August 6-9 2002.

Dimensionality Reduction

M. H. Law, A. K. Jain. "Incremental Nonlinear Dimensionality Reduction By Manifold Learning", IEEE Transactions of Pattern Analysis and Machine Intelligence. vol. 28, no. 3, pp: 377 - 391, March 2006.
M. Law, N. Zhang, A. K. Jain. "Nonlinear Manifold Learning for Data Stream", In Proceedings of SIAM Data Mining, pp. 33-44, Orlando, Florida, 2004. This paper receives the Best Student Paper award. The web site (under construction)

A. Fred, A.K. Jain, "Learning Pairwise Similarity for Data Clustering",Proceedings of 18th International Conference on Pattern Recognition (ICPR), Vol. 1, pp. 925 - 928, Hong Kong, August 20-24, 2006.
A. K. Jain, A. Topchy, M. Law, J. Buhmann. "Landscape of Clustering Algorithms", In Proceedings of the 17th International Conference on Pattern Recognition, pp. I-260--I-263, Cambridge UK, August 23-26, 2004.
A. K. Jain, S. Raudys. "Small sample size effects in statistical pattern recognition: recommendations for practitioners", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 3, pp. 252-264, March 1991.
K. Pettis and T. Bailey and A. K. Jain and R. Dubes. "An Intrinsic Dimensionality Estimator from Near-Neighbor Information", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, no. 1, pp. 25-36, 1979.
R. Dubes and A. K. Jain " Clustering Techniques: The User's Dilemma", Pattern Recognition, vol. 8, no. 4, pp. 247-260, 1976.

Semi-supervised Clustering

Pavan K. Mallapragada, Rong Jin, A. K. Jain. "Active Query Selection for Semi-supervised Learning", in Proceedings of the International Conference on Pattern Recognition (ICPR), Tampa, Florida, December 7-13 2008
Yi Liu , Rong Jin, A. K. Jain. "BoostCluster: Boosting Clustering by Pairwise Constraints", In Proceedings of Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 450-459, 2007.
A. K. Jain,Pavan K. Mallapragada, M. Law. "Bayesian Feedback in Data Clustering", In Proceedings of the 18th International Conference on Pattern Recognition, Vol. 3, pp. 374-378, Hong Kong, August 20-24, 2006.
T. Lange, M. H. Law, A. K. Jain, J. Buhmann. Learning With Constrained and Unlabelled Data. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol.1, pp. 730-737, June 2005.
M. H. Law, A. Topchy, A. K. Jain. Model-based Clustering With Probabilistic Constraints. In Proceedings of SIAM Data Mining, pp. 641-645, 2005.
M. H. Law, A. Topchy, A. K. Jain, "Clustering with Soft and Group Constraints", In Proceedings of the Joint IAPR International Workshop on Structural, Syntactic, And Statistical Pattern Recognition (S+SSPR 2004), pp. 662-670, 2004.

Feature Selection in Unsupervised Learning

M. Law, M. A. T. Figueiredo, A. K. Jain. "Simultaneous Feature Selection and Clustering Using Mixture Models", IEEE Transactions of Pattern Analysis and Machine Intelligence. vol. 26, no. 9, pp. 1154- 1166, September 2004. (IEEE Xplore) (Matlab code)
M. Figueiredo, A.K. Jain, M. Law. "A Feature selection wrapper for mixtures", in Proceedings of the First Iberian Conference on Pattern Recognition and Image Analysis, Puerto de Andratx, Spain, June 2003.
M. Law, M. Figueiredo, A. K. Jain. "Feature selection in mixture-based clustering", in Advances in Neural Information Processing Systems 15 (NIPS 2002), pp. 609-616, Vancouver, Dec 2002.
A. K. Jain and D. Zongker. "Feature-Selection: Evaluation, Application, and Small Sample Performance", IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 19, no. 2, pp. 153-158, February 1997. (IEEE Explore)

Other Clustering Related Papers

A. Fred, A.K. Jain, "Learning Pairwise Similarity for Data Clustering",Proceedings of 18th International Conference on Pattern Recognition (ICPR), Vol. 1, pp. 925 - 928, Hong Kong, August 20-24, 2006.
A. K. Jain, A. Topchy, M. Law, J. Buhmann. "Landscape of Clustering Algorithms", In Proceedings of the 17th International Conference on Pattern Recognition, pp. I-260--I-263, Cambridge UK, August 23-26, 2004.
A. K. Jain, S. Raudys. "Small sample size effects in statistical pattern recognition: recommendations for practitioners", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 13, no. 3, pp. 252-264, March 1991.
K. Pettis and T. Bailey and A. K. Jain and R. Dubes. "An Intrinsic Dimensionality Estimator from Near-Neighbor Information", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 1, no. 1, pp. 25-36, 1979.
R. Dubes and A. K. Jain " Clustering Techniques: The User's Dilemma", Pattern Recognition, vol. 8, no. 4, pp. 247-260, 1976.

Recent Theses

Pavan Kumar Mallapragada. "Some contributions to semi-supervised learning", Ph.D Thesis, 2010.
Hiu Chung Law. "Clustering, Dimensionality Reduction and Side Information", Ph.D Thesis, 2006.