Background

There are many available data integration methodologies. Choosing which methodology is best-suited for a particular task can be difficult. Misunderstandings and mistakes in data integration are possible if data sources do not describe their information in a semantically equivalent manner [1]. Research into semantic as well as syntactic data integration is of primary importance in the life sciences, where multiple data formats and types flourish. Data integration can be classified in many ways: syntactic versus semantic, federated versus warehousing, encapsulation versus translation.

Comprehensive reviews of the problems of semantic heterogeneity as well as of the data integration approaches used in the past have been written [2,3]. A thorough review of ontology mapping is present in [4, Section 9]. Work on ontology mapping as well as semantic data integration in bioinformatics includes mapping GO to UMLS [5], creating databases using RDF with S3DB [6] and OntoFusion [3]. OntoFusion, a recent example of ontology mapping within the biomedical domain, uses a only a set of syntactic ontologies, thus creating a query system that does not have a single core ontology describing the domain of interest. While multiple data sources can be queried via a single syntactic ontology within OntoFusion, the query is run directly on each syntactic ontology without any further semantic processing that a core ontology can provide.

The methodology presented here, rule-based mediation, uses rules to define mappings between data formats, and is a form of semantic data integration using layered ontologies. These rules are expressed using DL. DL are formalisms for knowledge representation characterised by various levels of expressivity. The more expressive a DL language is, the less tractable it is for reasoning purposes. Therefore, a language must be chosen that has an appropriate ratio of expressivity to tractability. DL are widely used in the biomedical community via the OWL-DL format (http://www.w3.org/TR/owl-ref/). Ontologies written in OWL-DL have access to a number of DL languages, and editors such as Protégé (http://protege.stanford.edu) can determine which DL language subset a particular ontology is written in. While other ontology languages such as OBO [7] are commonly used in the life sciences, reasoners provided for OWL-DL language subsets are more powerful [8].

DL has the ability to represent complex logic constructs such as number restrictions and relation hierarchies. Research domains can be successfully modelled in OBO, but DL allow more complex reasoning tasks and richer semantics. Modelling, together with the logical inferences available when reasoning over a DL-based ontology, make a DL format such as OWL-DL the best choice for our rule-based mediation strategy. With DL, the implicit knowledge that is present within an ontology -- and which is not immediately obvious to a human -- can be made explicit through inference and reasoning [9, p. 61].

RBM Home