Title page for ETD etd-11162009-144820


Type of Document Dissertation
Author Lawson, Mark Jon
Author's Email Address malawso4@vt.edu
URN etd-11162009-144820
Title The Search for a Cost Matrix to Solve Rare-Class Biological Problems
Degree PhD
Department Computer Science
Advisory Committee
Advisor Name Title
Zhang, Liqing Committee Chair
Fan, Weiguo Patrick Committee Member
Heath, Lenwood S. Committee Member
Ramakrishnan, Naren Committee Member
Wang, G. Alan Committee Member
Keywords
  • Local Search
  • Bioinformatics
  • Machine Learning
  • Classification
Date of Defense 2009-11-02
Availability unrestricted
Abstract
The rare-class data classification problem is a common one. It occurs when, in a dataset, the

class of interest is far outweighed by other classes, thus making it difficult to classify using

typical classification algorithms. These types of problems are found quite often in biological

datasets, where data can be sparse and the class of interest has few representatives. A variety

of solutions to this problem exist with varying degrees of success.

In this paper, we present our solution to the rare-class problem. This solution uses MetaCost,

a cost-sensitive meta-classifier, that takes in a classification algorithm, training data, and a

cost matrix. This cost matrix adjusts the learning of the classification algorithm to classify

more of the rare-class data but is generally unknown for a given dataset and classifier. Our

method uses three different types of optimization techniques (greedy, simulated annealing,

genetic algorithm) to determine this optimal cost matrix. In this paper we will show how

this method can improve upon classification in a large amount of datasets, achieving better

results along a variety of metrics. We will show how it can improve on different classification

algorithms and do so better and more consistently than other rare-class learning techniques

like oversampling and undersampling. Overall our method is a robust and effective solution

to the rare-class problem.

Files
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  Lawson_MJ_D_2009.pdf 992.01 Kb 00:04:35 00:02:21 00:02:04 00:01:02 00:00:05

Browse All Available ETDs by ( Author | Department )

dla home
etds imagebase journals news ereserve special collections
virgnia tech home contact dla university libraries

If you have questions or technical problems, please Contact DLA.