

Type of Document Dissertation Author Yan, Mingjin Author's Email Address myan@vt.edu URN etd-12062005-153906 Title Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion Degree PhD Department Statistics Advisory Committee
Advisor Name Title Ye, Keying Committee Chair Prins, Samantha C. Bates Committee Member Smith, Eric P. Committee Member Spitzner, Dan J. Committee Member Keywords
- Gap statistic
- Multi-layer clustering
- DD-weighted gap statistic
- Cluster analysis
- Weighted gap statistic
- Number of clusters
- K-means clustering
Date of Defense 2005-11-28 Availability unrestricted Abstract In cluster analysis, a fundamental problem is todetermine the best estimate of the number of clusters, which has a
deterministic effect on the clustering results. However, a
limitation in current applications is that no convincingly
acceptable solution to the best-number-of-clusters problem is
available due to high complexity of real data sets. In this
dissertation, we tackle this problem of estimating the number of
clusters, which is particularly oriented at processing very
complicated data which may contain multiple types of cluster
structure. Two new methods of choosing the number of clusters are
proposed which have been shown empirically to be highly effective
given clear and distinct cluster structure in a data set. In
addition, we propose a sequential type of clustering approach,
called multi-layer clustering, by combining these two methods.
Multi-layer clustering not only functions as an efficient method
of estimating the number of clusters, but also, by superimposing a
sequential idea, improves the flexibility and effectiveness of any
arbitrary existing one-layer clustering method. Empirical studies
have shown that multi-layer clustering has higher efficiency than one layer clustering approaches, especially in detecting
clusters in complicated data sets. The multi-layer clustering
approach has been successfully implemented in clustering the WTCHP
microarray data and the results can be interpreted very well based
on known biological knowledge.
Choosing an appropriate clustering method is another
critical step in clustering. K-means clustering is one of the most
popular clustering techniques used in practice. However, the
k-means method tends to generate clusters containing a
nearly equal number of objects, which is referred to as the
``equal-size'' problem. We propose a clustering method which
competes with the k-means method. Our newly defined method
is aimed at overcoming the so-called ``equal-size'' problem
associated with the k-means method, while maintaining its
advantage of computational simplicity. Advantages of the proposed
method over k-means clustering have been demonstrated empirically
using simulated data with low dimensionality.
Files
Filename Size Approximate Download Time (Hours:Minutes:Seconds)
28.8 Modem 56K Modem ISDN (64 Kb) ISDN (128 Kb) Higher-speed Access Proposal-Face.pdf 960.69 Kb 00:04:26 00:02:17 00:02:00 00:01:00 00:00:05
If you have questions or technical problems, please Contact DLA.