Title page for ETD etd-12062005-153906


Type of Document Dissertation
Author Yan, Mingjin
Author's Email Address myan@vt.edu
URN etd-12062005-153906
Title Methods of Determining the Number of Clusters in a Data Set and a New Clustering Criterion
Degree PhD
Department Statistics
Advisory Committee
Advisor Name Title
Ye, Keying Committee Chair
Prins, Samantha C. Bates Committee Member
Smith, Eric P. Committee Member
Spitzner, Dan J. Committee Member
Keywords
  • Gap statistic
  • Multi-layer clustering
  • DD-weighted gap statistic
  • Cluster analysis
  • Weighted gap statistic
  • Number of clusters
  • K-means clustering
Date of Defense 2005-11-28
Availability unrestricted
Abstract
In cluster analysis, a fundamental problem is to

determine the best estimate of the number of clusters, which has a

deterministic effect on the clustering results. However, a

limitation in current applications is that no convincingly

acceptable solution to the best-number-of-clusters problem is

available due to high complexity of real data sets. In this

dissertation, we tackle this problem of estimating the number of

clusters, which is particularly oriented at processing very

complicated data which may contain multiple types of cluster

structure. Two new methods of choosing the number of clusters are

proposed which have been shown empirically to be highly effective

given clear and distinct cluster structure in a data set. In

addition, we propose a sequential type of clustering approach,

called multi-layer clustering, by combining these two methods.

Multi-layer clustering not only functions as an efficient method

of estimating the number of clusters, but also, by superimposing a

sequential idea, improves the flexibility and effectiveness of any

arbitrary existing one-layer clustering method. Empirical studies

have shown that multi-layer clustering has higher efficiency than one layer clustering approaches, especially in detecting

clusters in complicated data sets. The multi-layer clustering

approach has been successfully implemented in clustering the WTCHP

microarray data and the results can be interpreted very well based

on known biological knowledge.

Choosing an appropriate clustering method is another

critical step in clustering. K-means clustering is one of the most

popular clustering techniques used in practice. However, the

k-means method tends to generate clusters containing a

nearly equal number of objects, which is referred to as the

``equal-size'' problem. We propose a clustering method which

competes with the k-means method. Our newly defined method

is aimed at overcoming the so-called ``equal-size'' problem

associated with the k-means method, while maintaining its

advantage of computational simplicity. Advantages of the proposed

method over k-means clustering have been demonstrated empirically

using simulated data with low dimensionality.

Files
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  Proposal-Face.pdf 960.69 Kb 00:04:26 00:02:17 00:02:00 00:01:00 00:00:05

Browse All Available ETDs by ( Author | Department )

dla home
etds imagebase journals news ereserve special collections
virgnia tech home contact dla university libraries

If you have questions or technical problems, please Contact DLA.