Research in Data Clustering

Welcome to the data clustering page at Michigan State University!

For our general research in Pattern Recognition and Image Processing, please visit the PRIP page
For our research in Biometric Authentication, please visit the Biometrics page

Overview

The goal of data clustering, or unsupervised learning, is to discover "natural" groupings in a set of patterns, points, or objects, without prior knowledge of any class labels. There are many applications of cluster analysis, including vector quantization, image segmentation, constructing the prototypes of classifiers, understanding genomic data, market segmentation, etc. Despite its long history, clustering still poses a number of open research problems. Two surveys on clustering are:

Below are some recent publications of our group in this area.

Fitting a Mixture of Gaussians

The standard algorithm for fitting a mixture of Gaussians to a data set is the classic EM algorithm. However, EM algorithm has several known weaknesses: the number of components needs to be fixed beforehand, EM can converge to a poor local optimum, and EM can converge towards a singular estimate at the boundary of the parameter space. These issues are addressed by the algorithm described in the following papers. The Matlab code for our algorithm is available for download.

Feature Selection in Unsupervised Learning

Given a large number of features, feature selection finds a subset of the available features that is appropriate for the task at hand. Feature selection can be of tremendous help when one faces the "curse of dimensionality". Most previous work on feature selection is for supervised classification. We consider feature selection in unsupervised learning in the following papers.

Combination of Clustering Algorithms

Combination of multiple classifiers in supervised classification has achieved great success and it is  becoming one of the standard techniques in pattern recognition. However, little has been done to explore how to combine data partitions generated by different clustering algorithms. The following papers investigate different issues on combining the outputs of multiple clustering algorithms.

Semi-supervised Learning

Multiobjective Data Clustering

Most clustering algorithms generate the output partition by explicitly or implicitly minimizing a single objective function. Unfortunately, clusters in real world data sets are "heterogeneous" (of diverse shapes and data densities), and it is difficult for a single clustering algorithm to detect different types of clusters. We explore how to use multiple clustering criteria simultaneously in the following paper.

Nonlinear Dimensionality Reduction

Other Papers on Clustering and Dimensionality Reduction

Recent Theses


Comments and suggestions are welcome. Please direct them to Pavan .