Title page for ETD etd-02062001-114915


Type of Document Dissertation
Author Sornil, Ohm
Author's Email Address osornil@vt.edu
URN etd-02062001-114915
Title Parallel Inverted Indices for Large-Scale, Dynamic Digital Libraries
Degree PhD
Department Computer Science
Advisory Committee
Advisor Name Title
Fox, Edward Alan Committee Chair
Edwards, Stephen H. Committee Member
Koelling, Charles Patrick Committee Member
Ramakrishnan, Naren Committee Member
Varadarajan, Srinidhi Committee Member
Keywords
  • simulation
  • incremental update
  • information retrieval
  • parallel inverted index
  • hybrid partitioning
  • performance
  • digital library
  • terabyte text collection
Date of Defense 2001-01-25
Availability unrestricted
Abstract
The dramatic increase in the amount of content available in digital forms gives rise to large-scale digital libraries, targeted to support millions of users and terabytes of data. Retrieving information from a system of this scale in an efficient manner is a challenging task due to the size of the collection as well as the index. This research deals with the design and implementation of an inverted index that supports searching for information in a large-scale digital library, implemented atop a massively parallel storage system. Inverted index partitioning is studied in a simulation environment, aiming at a terabyte of text. As a result, a high performance partitioning scheme is proposed. It combines the best qualities of the term and document partitioning approaches in a new Hybrid Partitioning Scheme. Simulation experiments show that this organization provides good performance over a wide range of conditions. Further, the issues of creation and incremental updates of the index are considered. A disk-based inversion algorithm and an extensible inverted index architecture are described, and experimental results with actual collections are presented. Finally, distributed algorithms to create a parallel inverted index partitioned according to the hybrid scheme are proposed, and performance is measured on a portion of the equipment that normally makes up the 100 node Virginia Tech PetaPlex™ system.


NOTE: (02/2007) An updated copy of this ETD was added after there were patron reports of problems with the file.

Files
  Filename       Size       Approximate Download Time (Hours:Minutes:Seconds) 
 
 28.8 Modem   56K Modem   ISDN (64 Kb)   ISDN (128 Kb)   Higher-speed Access 
  dissertation.pdf 1.07 Mb 00:04:56 00:02:32 00:02:13 00:01:06 00:00:05
  dissertation_printTo7.pdf 1.13 Mb 00:05:13 00:02:41 00:02:21 00:01:10 00:00:06

Browse All Available ETDs by ( Author | Department )

dla home
etds imagebase journals news ereserve special collections
virgnia tech home contact dla university libraries

If you have questions or technical problems, please Contact DLA.