Notes on the Development of the Syntactic Ontologies

The core of this rule-based mediation strategy for model annotation is the telomere ontology, the core ontology for the Proctor et al. model. Each syntactic ontology is separately mapped to this ontology. Just as the syntactic ontologies provide input data to the telomere ontology, they also can provide an output route. This ability gives them the scope to act as a translation system from any syntactic ontology to any other syntactic ontology. It is through this bi-directionality of the information flow that new knowledge can be returned to the originator of a query. Here we present a summary of each of the syntactic ontologies built for these use cases together with a summary of the telomere ontology itself. There are as many syntactic ontologies as there are data formats, with data sources sharing a common format also sharing a syntactic ontology.

The data sources used were BioGRID [13], Pathway Commons (http://www.pathwaycommons.org), and UniProtKB [14]. Table 1 provides an overview of the main types of information retrieved from each of the data sources. A $ \checkmark$ identifies a data type that can always be found from the associated data source, for example downloading a BioGRID interaction file will always include interactions and interaction types. However, some data types are not always available from a given data source. Such partial associations are shown with a \dag. Table 1 describes the information provided by the data sources in the context of the use cases only. For instance, even though interaction data may be present within UniProtKB entries, as yet no mapping rules have been written and therefore that column is left blank.


Table 1: Data sources and the types of information they provide with respect to the use cases. Check marks imply complete presence of that information, while daggers mark data types that are not always available from that data source.
Data Source Interaction Interaction Type Entity Identification Localization
BioGRID $ \checkmark$ $ \checkmark$ \dag  
Pathway Commons $ \checkmark$ \dag $ \checkmark$ \dag
UniProtKB     $ \checkmark$ \dag
SBML \dag \dag \dag \dag


The \dag for BioGRID's entity identification column in Table 1 represents the lack of a UniProtKB identifier for some interactors. Specifically for the use cases, the BioGRID entity representing `rad9' does not have a cross reference to UniProtKB. Localization information is also not available from the BioGRID input data. For Pathway Commons, all data types are theoretically available as the BioPAX format models them. The actual instance data returned from Pathway Commons does not contain information on either localisation or interaction type. Retrieved UniProtKB information consists of entity localisation and identification, though localisation information is not always present. The SBML syntactic ontology is being used as an output rather than as an input for these use cases, however SBML models may provide any of the described data types.

An existing SBML syntactic ontology, MFO, allows both input of user queries and output of rule-based mediation responses [15]. It is used as an input point for all data sources in SBML format. Syntactic ontologies have been deliberately created as direct translations of non-OWL data formats into OWL. The purpose of a syntactic ontology is to act as a literal, syntactic description of the data source in OWL. As it is the core ontology where the integration and the majority of the inference will occur, it is there that all of the semantic modelling is performed.

Of the four data sources required for the use cases described in the submission, one syntactic ontology had been created by the authors in a previous work, another did not need to be explicitly generated because it was already in OWL-DL, and the other two needed to be written. Those latter two syntactic ontologies were generated using the XMLTab plugin for Protégé 3.4 RC1 (http://protegewiki.stanford.edu/index.php/XML_Tab). This plugin has a number of advantages and disadvantages, but overall it was a good choice for the initial creation of the new syntactic ontologies. After initial generation of the OWL files, changes to the initial OWL files can be made at any time, as needed.

The particular advantages of using XMLTab include:

  1. very quick initial creation of each syntactic ontology: each one only took a few seconds to be generated;
  2. if an XML file was provided instead of an XSD, then both classes and instances were generated in the syntactic ontology -- the classes representing structural elements and attributes present in the XML file, and the instances being created from the actual data contained within the XML file;
  3. exact duplication of the XML structure within OWL-DL, which is one of the requirements for each syntactic ontology in the rule-based mediation methodology.

However, XMLTab is not the perfect choice. Some things in particular would be useful in whatever application is used when this work is scaled for larger data integration tasks:

  1. must be able to first load the XSD to generate a complete OWL file with all possible classes, followed by loading multiple XML files to get all necessary instances. XMLTab only allows the import of one file, either XML or XSD, to generate the OWL file: additional files cannot be applied serially to build up an OWL file based on more than one XML file.
  2. it is not clear that it is in active development

The use of existing tools to implement rule-based mediation increases its usability for other researchers as well as decreases the development type. Therefore wherever possible, existing tools were used.

RBM Home