Making Digital Libraries Flexible, Scalable and Reliable: Reengineering the MARIAN System in JAVA

Jianxin Zhao

Thesis submitted to the Faculty of
Virginia Polytechnic Institute and State University
in partial fulfillment of the requirements for the degree of

Master of Science
in
Computer Science

APPROVED:
Edward A. Fox, Chair
Sallie M. Henry
Dennis G. Kafura
June 16, 1999

Blacksburg, Virginia

KEYWORDS: Online Public Access Catalog, Digital Library, User Information Layer, Personalization, Project Management, Reengineering

Copyright 1999, Jianxin Zhao


Abstract

There is a great need for digital libraries that are flexible, scalable, and reliable. Few such systems exist. Little is known about how to build them. This thesis addresses these problems by enhancing a prototype digital library system with the aim of making it more flexible, scalable, and reliable.

We hypothesize that: 1) adding a new (user information) layer and maintaining weak coupling in the design of a digital library system can help achieve system flexibility; 2) optimizing network connection usage and facilitating distribution of computation and disk operations in system design can help achieve system scalability; and 3) applying good software processes can help university students produce a very reliable system.

Approaches based on the above hypothesis were used in the project of Reengineering the MARIAN System in Java. The results of the project and experiments verified the correctness of the hypothesis.

The results of this thesis may help inform future digital library design and implementation projects to produce flexible, scalable, and reliable systems.


Table of Contents

Title pageMaking Digital Libraries Flexible, Scalable and Reliable: Reengineering the MARIAN System in JAVA
Acknowledgments
1 Introduction
1.1.Problem Statement
1.2.Hypothesis
1.3.Organization
2 Related Works
2.1.Digital Libraries and Online Public Access Catalog Systems
2.2.Hyper Text Transfer Protocol (HTTP)
2.3.Common Gateway Interface (CGI)
2.4.Transmission Control Protocol (TCP) and User Datagram Protocol (UDP)
2.5.Java
2.6.Capability Maturity Model (CMM)
3 The C/C++ MARIAN System
3.1.MARIAN History
3.2.Major Feature Descriptions
3.3.Top Level Architecture
3.4.Operations
4 Reengineering the "Formit"
4.1.The C/C++ "Formit"
4.2.4.2 The Java "Formit"
4.3.Flexibility Considerations
5 Reengineering the "Webgate"
5.1.The C/C++ "Webgate"
5.1.1.Architecture
5.1.2.Operations
5.2.The Java "Webgate"
5.2.1.Architecture
5.2.2.Operations
5.2.2.1.System Startup
5.2.2.2.Handle "Formit" Requests
5.2.2.3.System Shutdown
5.3.Flexibility Considerations
5.3.1.Adding a New Layer in System Design
5.3.2.Maintaining Weak Coupling
5.3.2.1.Top Level Weak Coupling
5.3.2.2.Weak Coupling Inside "User_manager"
5.3.2.3.Weak Coupling Inside "Uip_manager"
5.3.2.4.Weak Coupling Inside "Request_response" Thread
5.3.2.5.Weak Coupling Among "Results", "Request_response", and "Call_back_processor"
5.4.Scalability Considerations
5.5.Java "Webgate" Major Feature Descriptions
6 User Information Layer
6.1.Architecture Description
6.2.Benefits of the User Information Layer
6.2.1.Reducing Workload of Search Engine
6.2.2.Personalization
6.2.3.Active System
6.2.4.Billing Capability
6.2.5.Distance Learning
6.2.6.User Characteristics Analysis
6.2.7.Integrated Service
6.3.Summary
7 Reengineering the MARIAN Server
7.1.The C/C++ MARIAN Server
7.1.1.Architecture
7.2.Operations
7.3.The Java MARIAN server
7.3.1.Architecture
7.3.1.1."Client_uip" and "Server_uip"
7.3.1.2."Session_manager"
7.3.2.Passing Functions Through "Uip"
7.3.3.Operations
7.4.Flexibility Considerations
7.4.1.Top Level Weak Coupling
7.4.2.Weak Coupling Inside "Server_uip"
7.4.3.Weak Coupling Inside "Session_manager"
7.4.4.Weak Coupling With "Webgate"
7.5.Scalability Considerations
7.5.1.Concurrency Control
7.5.2.Optimize the Usage of Network Connection
7.5.3.Facilitate Computation Distribution
8 Applying Good Software Processes
8.1.Project Development Lifecycle
8.1.1.Requirements Analysis
8.1.2.Original Project Analysis
8.1.3.High Level Design
8.1.4.Detailed Design
8.1.5.Coding
8.1.6.Unit Testing
8.1.7.Integration Testing
8.2.Project Management
8.2.1.Group Assignment
8.2.2.Training
8.2.3.Measurement & Estimation
8.2.4.Process Improvement and Defect Prevention
8.2.5.Software Reuse
8.3.Results
9 System Performance Experiments
9.1.Experimental Design
9.1.1.Measurements
9.1.2.Cases
9.2.Getting the Correct Data
9.3.Experiment Results
9.3.1.Removal of System Bottlenecks
9.3.2.Measuring the Cost of Measurement
9.3.3.Final Experiment Results
9.4.Summary
9.4.1.Scalability
9.4.2.Reliability
9.4.3.Flexibility
10 Conclusions and Future Directions
10.1.Conclusions
10.2.Limitations and Future Directions
Bibliography
Vita

List of Tables

3.1.C/C++ MARIAN system module sizes
5.1.Java "webgate" request types and descriptions
6.1.Database examples
8.1.Measurement and estimation table
8.2.Bug history table
9.1.Modification list
9.2.System configuration recommendations

List of Figures

3.1.Old system search page
3.2.Old system results page
3.3.Old system records detailed description
3.4.Old system top-level architecture
4.1.C/C++ "formit" workflow
4.2.Java "formit" architecture
4.3.Java "formit" workflow
5.1.C/C++ "webgate" top-level architecture
(We use ellipses to represent objects and rectangles to represent threads in all the design diagrams in this thesis.)
5.2.Java "webgate" top-level architecture
5.3."Uip_manager" object architecture
5.4."User_manager" object architecture
5.5."Request_response" thread architecture and workflow
5.6.Relation among "results", "request_response", and "call_back_processor"
5.7.New system beginning page
5.8.New system main menu page
5.9.New system query page
5.10.New system query history page
5.11.New system super user main menu page
5.12.New system log query page
5.13.New system log contents page
5.14. New system user management page
6.1.Digital library architecture
7.1.C/C++ MARIAN server top-level architecture
7.2.Simplified C/C++ MARIAN top-level architecture
7.3.Java MARIAN server top-level architecture
7.4."Client_uip" architecture
7.5."Server_uip" architecture
7.6."Session_manager" architecture
7.7.Passing functions through "uip"
(Note: 1 labels the path of functions from the client, while 2 labels the path of functions back from the server.)
7.8.Java MARIAN server operations (Note: numbers label steps in processing for a given user.)
7.9.Future "session_manager" architecture
7.10.Distributed search engines architecture
8.1.Basic development phases
9.1.Experiment model
9.2.Load generator architecture
9.3.Java server cost of measurement graphs
9.4."Webgate" cost of measurement graphs
9.5.All modules in one machine, performance graphs
9.6.One "webgate", performance graphs
9.7.Two "webgates", performance graphs
9.8.Four "webgates", performance graphs
9.9.Performance comparison graphs

[Title] [Ack] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [Bib] [Vita]

ETD-ML Version 0.9.7a (beta) http://etd.vt.edu/etd-ml/ Mon Jul 19 11:13:10 1999