Clustering gene expression data with repeated measurements


Ka Yee Yeung, Mario Medvedovic and Roger E. Bumgarner


Abstract
Background: Clustering is a frequent methodology for the analysis of array data and numerous clustering algorithms are in common use. In addition, many research laboratories are generating array data with repeated measurements. While there are proposals in the literature making use of replicate-derived statistics to improve the selection of differentially expressed genes, there has been limited effort to improve clustering algorithms by incorporating repeated measurements. In addition, the biologist who wishes to make use of cluster analysis is faced with a plethora of algorithmic options and often has no basis on which to select a methodology for the analysis of his/her data set.

Results: Our main contributions are extensions of clustering techniques to take advantage of repeated measurements and an empirical study comparing the performance of different clustering approaches for array data. We evaluated the approach of weighing expression levels with variability estimates in similarity measures, assigning repeated measurements to the same subtrees in hierarchical agglomerative clustering algorithms, and an infinite mixture model-based approach with built-in error models for repeated measurements. We employ two assessment criteria to evaluate clustering results: accuracy with respect to external knowledge and cluster stability (reproducibility of clustering results on re-measured data).

Conclusions: We show that array data with repeated measurements yield more accurate and more stable clusters. Our study also provides guidance to the user who wishes to select one "good" algorithm for cluster analysis of gene expression data. In particular, we show that the model-based clustering approaches produce superior clusters.