|
Background: Clustering is a frequent methodology for the analysis
of array data and numerous clustering algorithms are in common use.
In addition, many research laboratories are generating array data with
repeated measurements. While there are proposals in the literature
making use of replicate-derived statistics to improve the
selection of differentially expressed genes, there has been limited
effort to improve clustering algorithms by incorporating repeated
measurements. In addition, the biologist who wishes to make use
of cluster analysis is faced with a plethora of algorithmic options
and often has no basis on which to select a methodology for the
analysis of his/her data set.
Results: Our main contributions are extensions of clustering
techniques to take advantage of repeated measurements and an empirical
study comparing the performance of different clustering approaches
for array data. We evaluated the approach of weighing
expression levels with variability estimates in similarity measures,
assigning repeated measurements to the same subtrees in hierarchical
agglomerative clustering algorithms, and an infinite mixture
model-based approach with built-in error models for repeated
measurements. We employ two assessment criteria to evaluate clustering
results: accuracy with respect to external knowledge and cluster
stability (reproducibility of clustering results on re-measured data).
Conclusions: We show that array data with repeated measurements
yield more accurate and more stable clusters. Our study also provides
guidance to the user who wishes to select one "good" algorithm for
cluster analysis of gene expression data. In particular, we
show that the model-based clustering approaches produce superior clusters.
|