Research in Data Clustering

[Home] [ Theses] [ Software] [ Publications]

For our research in Pattern Recognition and Image Processing, visit the PRIP page

For our research in biometrics, visit the Biometrics page

Model Based Clustering

Large Scale Kernel-Based Data Clustering

Kernel-based clustering algorithms achieve better performance on real world data than the Euclidean distance-based clustering algorithms, but pose two important challenges: (i) they do not scale sufficiently in terms of run-time and memory complexity, i.e. their complexity is quadratic in the number of data instances, rendering them inefficient for large data sets containing millions of data points, and (ii) the choice of the kernel function is very critical to the performance of the algorithm. In this project, we aim at developing efficient schemes to reduce the complexity of these clustering algorithms and learn appropriate kernel functions from the data. We employ matrix approximation techniques based on randomization to achieve speedup and reduce the memory requirements of kernel-based clustering. We evaluate the efficiency of our techniques in the domains of object categorization and document clustering.


One of the main challenges in data clustering is to define an appropriate similarity measure between two objects. Crowdclustering addresses this challenge by defining the pairwise similarity based on the manual annotations obtained through crowdsourcing.

Semi-supervised Boosting

Model Based Clustering

The standard algorithm for fitting a mixture of Gaussians to a data set is the classic EM algorithm. However, EM algorithm has several known weaknesses: the number of components needs to be fixed beforehand, EM can converge to a poor local optimum, and EM can converge towards a singular estimate at the boundary of the parameter space. These issues are addressed by the algorithm described in the following papers. The Matlab code for our algorithm is available for download.

Multiobjective Data Clustering

Most clustering algorithms generate the output partition by explicitly or implicitly minimizing a single objective function. Unfortunately, clusters in real world data sets are "heterogeneous" (of diverse shapes and data densities), and it is difficult for a single clustering algorithm to detect different types of clusters. We explore how to use multiple clustering criteria simultaneously in the following paper.

Cluster Ensembles

Combination of multiple classifiers in supervised classification has achieved great success and it is  becoming one of the standard techniques in pattern recognition. However, little has been done to explore how to combine data partitions generated by different clustering algorithms. The following papers investigate different issues on combining the outputs of multiple clustering algorithms.

Dimensionality Reduction

Semi-supervised Clustering

Feature Selection in Unsupervised Learning

Other Clustering Related Papers

Recent Theses