

Type of Document Master's Thesis Author Rajasimha, Harsha Karur Author's Email Address hrajasim@vt.edu URN etd-12202004-135546 Title PathMeld: A Methodology for The Unification of Metabolic Pathway Databases Degree Master of Science Department Computer Science Advisory Committee
Advisor Name Title Dr. Lenwood S. Heath Committee Chair Dr. Naren Ramakrishnan Committee Member Dr. Ruth Grene Committee Member Keywords
- EcoCyc
- KEGG
- integration
- unification
- PGDB
- PathMeld
- MetaCyc
- metabolic pathway databases
Date of Defense 2004-12-15 Availability unrestricted Abstract A biological pathway database is a database that describes biochemical pathways, reactions,enzymes that catalyze the reactions, and the substrates that participate in these reactions. A
pathway genome database (PGDB) integrates pathway information with information about the
complete genome of various sequenced organisms. Two of the popular PGDBs available today
are the Kyoto Encyclopedia of Genes and Genomes (KEGG) and MetaCyc. The proliferation of
biological databases in general raises several questions for the life scientist. Which of these databases is most accurate, most current, or most comprehensive? Do they have a standard format? Do they complement each other? Overall, which database should be used for what purpose? If more than one database is deemed relevant, it is desirable to have a unified database containing information from all the shortlisted databases. While XML based pathway data exchange standards such as BioPAX and SBML are emerging, these do not address the basic problems such as inconsistent nomenclature and substrate matching between databases in the unification of pathway databases.
Here, we present the PathMeld methodology to unify KEGG and MetaCyc databases starting
from their flat files. Individual PGDBs are transformed into a unified schema that we design. With individual PGDBs in the common unified schema, the key to the PathMeld methodology is to find the entity correspondences between the KEGG and MetaCyc substrates. We present a heuristic-driven approach for one-to-one mapping of the substrates between KEGG and MetaCyc. Using the exact name and chemical formula match criteria, 82.6% of the substrates in MetaCyc were matched accurately to corresponding substrates in KEGG. The substrate names in the MetaCyc database
contain html tags and non-characters such as <sub>, <sup>, <i>, <l>, &, and $. The MetaCyc chemical formula are stored in lisp format in the database while KEGG stores them as continuous strings. Hence, we subject MetaCyc chemical formulae to transformation into KEGG format to make them directly comparable. Applying pre-processing to transform MetaCyc substrate names and formulae improved substrate matching by 2%. To investigate how many of the remaining 17.4%
substrates are indeed absent from KEGG, we employ a standard UNIX based approximate string
matching tool called agrep. The resulting matches are curated into four mutually exlusive groups:
3.83% are correct matches, 3.17% are close matches, and 7.45% are incorrect matches. 3.68% of
MetaCyc substrate names are not matched at all. This shows that 11.13% of MetaCyc substrate
names are absent in KEGG. We note some of the implementation issues we solved. First, parsing
only one flat file to populate one database table is not sufficient. Second, intermediate database
tables are needed. Third, transformation of substrate names and chemical formula from one of the component databases is required for comparison. Fourth, a biochemist's intervention is needed in evaluating the approximate substrate matches from agrep.
In conclusion, the PathMeld methodology successfully unifies KEGG and MetaCyc flat file
databases into a unified PostgreSQL database. Matching substrates between databases is the key
issue in the unification process. About 83% of the substrate correspondences can be computationally
achieved, while the remaining 17% substrates require approximate matching and manual curation
by a biochemist. We presented several different techniques for substrate matching and showed that
about 11% of the MetaCyc substrates do not match and hence are absent from KEGG.
Files
Filename Size Approximate Download Time (Hours:Minutes:Seconds)
28.8 Modem 56K Modem ISDN (64 Kb) ISDN (128 Kb) Higher-speed Access Rajasimha_MSCS2004.pdf 756.69 Kb 00:03:30 00:01:48 00:01:34 00:00:47 00:00:04
If you have questions or technical problems, please Contact DLA.