Title page for ETD etd-02162007-005107


Type of Document Dissertation
Author Chen, Yuxin
Author's Email Address yuchen@vt.edu
URN etd-02162007-005107
Title A Novel Hybrid Focused Crawling Algorithm to Build Domain-Specific Collections
Degree PhD
Department Computer Science
Advisory Committee
Advisor Name Title
Fox, Edward Alan Committee Chair
Fan, Weiguo Patrick Committee Member
Lu, Chang-Tien Committee Member
Ramakrishnan, Naren Committee Member
Torres, Ricardo da Silva Committee Member
Keywords
  • meta-search
  • digital libraries
  • focused crawler
  • classification
Date of Defense 2007-02-05
Availability unrestricted
Abstract
The Web, containing a large amount of useful information and resources, is expanding rapidly. Collecting domain-specific documents/information from the Web is one of the most important methods to build digital libraries for the scientific community. Focused Crawlers can selectively retrieve Web documents relevant to a specific domain to build collections for domain-specific search engines or digital libraries. Traditional focused crawlers normally adopting the simple Vector Space Model and local Web search algorithms typically only find relevant Web pages with low precision. Recall also often is low, since they explore a limited sub-graph of the Web that surrounds the starting URL set, and will ignore relevant pages outside this sub-graph. In this work, we investigated how to apply an inductive machine learning algorithm and meta-search technique, to the traditional focused crawling process, to overcome the above mentioned problems and to improve performance. We proposed a novel hybrid focused crawling framework based on Genetic Programming (GP) and meta-search. We showed that our novel hybrid framework can be applied to traditional focused crawlers to accurately find more relevant Web documents for the use of digital libraries and domain-specific search engines. The framework is validated through experiments performed on test documents from the Open Directory Project. Our studies have shown that improvement can be achieved relative to the traditional focused crawler if genetic programming and meta-search methods are introduced into the focused crawling process.
Files
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  YuxinDissertation_etd_final1.pdf 820.23 Kb 00:03:47 00:01:57 00:01:42 00:00:51 00:00:04

Browse All Available ETDs by ( Author | Department )

dla home
etds imagebase journals news ereserve special collections
virgnia tech home contact dla university libraries

If you have questions or technical problems, please Contact DLA.