

Type of Document Dissertation Author Lawson, Mark Jon Author's Email Address malawso4@vt.edu URN etd-11162009-144820 Title The Search for a Cost Matrix to Solve Rare-Class Biological Problems Degree PhD Department Computer Science Advisory Committee
Advisor Name Title Zhang, Liqing Committee Chair Fan, Weiguo Patrick Committee Member Heath, Lenwood S. Committee Member Ramakrishnan, Naren Committee Member Wang, G. Alan Committee Member Keywords
- Local Search
- Bioinformatics
- Machine Learning
- Classification
Date of Defense 2009-11-02 Availability unrestricted Abstract The rare-class data classification problem is a common one. It occurs when, in a dataset, theclass of interest is far outweighed by other classes, thus making it difficult to classify using
typical classification algorithms. These types of problems are found quite often in biological
datasets, where data can be sparse and the class of interest has few representatives. A variety
of solutions to this problem exist with varying degrees of success.
In this paper, we present our solution to the rare-class problem. This solution uses MetaCost,
a cost-sensitive meta-classifier, that takes in a classification algorithm, training data, and a
cost matrix. This cost matrix adjusts the learning of the classification algorithm to classify
more of the rare-class data but is generally unknown for a given dataset and classifier. Our
method uses three different types of optimization techniques (greedy, simulated annealing,
genetic algorithm) to determine this optimal cost matrix. In this paper we will show how
this method can improve upon classification in a large amount of datasets, achieving better
results along a variety of metrics. We will show how it can improve on different classification
algorithms and do so better and more consistently than other rare-class learning techniques
like oversampling and undersampling. Overall our method is a robust and effective solution
to the rare-class problem.
Files
Filename Size Approximate Download Time (Hours:Minutes:Seconds)
28.8 Modem 56K Modem ISDN (64 Kb) ISDN (128 Kb) Higher-speed Access Lawson_MJ_D_2009.pdf 992.01 Kb 00:04:35 00:02:21 00:02:04 00:01:02 00:00:05
If you have questions or technical problems, please Contact DLA.