Ontology Alignment Evaluation Initiative - OAEI-2008 CampaignOAEI

Results of the OAEI-2008 Anatomy Track

The data used within the anatomy track (ontologies and reference alignment) has already been desrcibed in the 2007 results webpage. We suggest the reader not familar with the data set to visit http://webrum.uni-mannheim.de/math/lski/align2007/results.html first. The same holds with respect to the definition of recall+ and the description of the label comparison approach to distinguish between trivial and non trivial correspondences.

Test Data and Experimental Setting

We divided the task of automatically generating an alignment between the human anatomy and the mouse anatomy into four subtasks. Task #1 is obligatory for participants of the anatomy track, while task #2, #3 and #4 are optional tasks. Compared to 2007, we also introduced #4 as challenging fourth subtask. For task #1 the matching system has to be applied with standard settings to obtain a result that is as good as possible with respect to the expected f-value. With respect to this task we are in particular interested in how far matching systems improved their results compared to last years evaluation. For task #2 an alignment with increased precision has to be found. Contrary to this, in task #3 an alignment with increased recall has to be generated. We believe that systems configurable with respect to these requirements will be much more useful in concrete scenarios compared to static systems. While we expect most systems to solve the first three tasks, we expect only few systems to solve task #4. For this task a part of the reference alignment is available as additional input. In task #4 we tried to simulate the following scenario. Suppose that a group of domain experts already created an incomplete reference alignment by manually validating a set of automatically generated correspondences. As a result a partial reference alignment, in the following referred to as Rp, is available. Given both ontologies as well as Rp, a matching system should be able to exploit the additional information encoded in Rp. We constructed Rp as the union of the correct trivial correspondences and a small set of 54 non trivial correspondences. Thus Rp consists of 988 correspondences, while the complete reference alignment R contains 1523 correspondences.


In total, nine systems participated in the anatomy task (in 2007 there were 11 participants). These systems can be divided into a group of systems using biomedical background knowledge (e.g. UMLS as lexical reference system) and another group of systems that do not exploit domain specific background knowledge. Systems belonging to the first group are SAMBO and ASMOV while the other systems belong to the second group. Table 1 gives an overview of participating systems (rows presented in grey font refer to systems which only participated in 2007). In 2007 we observed that systems of the first group have a significant advantage of finding non trivial correspondences, in particular the best three systems (AOAS, SAMBO, and ASMOV) made use of background knowledge. We will later see wether this assumption could be verified with respect to the 2008 submissions.

Matching System Runtime BK Precision Recall Recall+ F-measure
AOAS ≈ n.a. yes  n.a.    0.928  n.a.    0.815  n.a.    0.523  n.a.    0.868
SAMBO ≈ 12h yes 0.869  0.845 0.836   0.586  0.601 0.852  0.821
SAMBOdtf ≈ 17h yes 0.831   0.833   0.579   0.832  
RiMOM ≈ 24min no 0.929  0.377 0.735  0.668 0.350  0.404 0.821  0.482
aflood 1min 5s no 0.874   0.682   0.275   0.766  
Label Eq. - no 0.981  0.981 0.613  0.613 0.000  0.000 0.755  0.755
Lily ≈ 3h 20min no 0.796  0.481 0.693  0.567 0.470  0.387 0.741  0.520
FalconAO ≈ 12min no  n.a.    0.963  n.a.    0.599  n.a.    0.127  n.a.    0.738
ASMOV ≈ 3h 50min yes 0.787  0.802 0.652  0.711 0.246  0.280 0.713  0.754
AROMA 3min 50s no 0.803   0.560   0.302   0.660  
DSSim ≈ 17min no 0.616  0.208 0.624  0.189 0.170  0.070 0.620  0.198
Prior ≈ 23min no  n.a.    0.593  n.a.    0.598  n.a.    0.350  n.a.    0.596
TaxoMap ≈ 25min no 0.460  0.586 0.764  0.700 0.470  0.234 0.574  0.638
XSOM ≈ 10h no  n.a.    0.915  n.a.    0.212  n.a.    0.008  n.a.    0.344
Table 1: All participants of 2007 and 2008 evaluation. Runtime, use of domain specific background knowledge (BK), precision, recall, recall+ and f-value for task #1. Results of the 2007 evaluation are presented in smaller font if available. Notice that the measurements of 2007 have been slightly corrected due to some minor modifications of the reference alignment.

Compliance measures for task

Table 1 lists the results of the participants in descending order with respect to the achieved f-value (we ordered the systems according to their 2008 f-value and used the values of 2007 for those systems not participating this year). In the first row of the 2008 submission we find the SAMBO system followed by its extension SAMBOdtf. SAMBO has achieved slightly better results for both precision and recall in 2008 compared to the 2007 submission. SAMBO now nearly reaches the f-value which AOAS achieved 2007. This is an notable result, since SAMBO is originally designed to generate alignment suggestions that are afterwards presented to a human evaluator in an interactive fashion. While SAMBO and SAMBOdtf make extensive use of biomedical background knowledge, the RiMOM matching system is mainly based on computing label edit-distances combined with similarity propagation strategies. Due to a major improvement of the RiMOM results, RiMOM is now one of the top matching systems for the anatomy track even though it does not make use of any specific background knowledge. Notice also that RiMOM solves the matching task in a very efficient way. Nearly all matching systems particpating in 2007 improved their results, while ASMOV and TaxoMap obtained slightly worse results. Further considerations have to clarify the reasons for this decline.

Task #2 and #3

As explained above these subtasks show in how far matching systems can be configured towards a trade-off between precision and recall. To our suprise only four participants submitted results for task #2 and #3 showing that they were able to adapt their system for different scenarios of application. These systems were RiMOM, Lily, ASMOV, and DSSim. The results for track #2 and #3 can be found in table 2. The most interesting results can be found with respect to track #3. Here we see that both Lily and RiMOM were able to detect over 50% of the non trivial correspondences with acceptable precision. In particular, Lily outperforms all particpants with respect the task of finding non-trivial correspondences when taking all of the results of 2007 and 2008 into account. This is a suprising result since Lily does not use any biomedical background knowledge.

Matching System Task #1 Task #2 Task #3
Prec Rec Rec+ F-Measure Prec Rec F-Measure Prec Rec Rec+ F-Measure
RiMOM 0.929 0.735 0.350 0.821 0.964 0.677 0.795 0.450 0.808 0.538 0.578
Lily 0.796 0.693 0.470 0.741 0.863 0.540 0.664 0.490 0.790 0.613 0.605
ASMOV 0.787 0.652 0.246 0.713 0.944 0.044 0.084 0.763 0.647 0.238 0.700
DSSim 0.616 0.624 0.170 0.620 0.687 0.525 0.595 0.339 0.538 0.258 0.416
Table 2: Precision, Recall, Recall+ and F-measure comapring results of track #1, #2 and #3

Task #4

Four systems particpated in task #4. These systems are SAMBO and SAMBOdtf, RiMOM, and ASMOV. In the following we refer to alignment generated my a matcher for task #1 resp. #4 as M1 resp. M4. Notice first of all that a direct comparison between M1 and M4 is not appropriate to measure the improvement that results from exploiting Rp. We thus have to compare M1 - Rp resp. M4 - Rp with the unknown subset of the reference alignment Ru = Rp - R. The differences between M1(partial reference mapping not given) and M4 (partial reference given) are presented in table 3. All particpants slightly increased the overall quality of the generated alignments with respect to the unknown part of the reference alignment. SAMBOdtf and ASMOV exploited the partial reference alignment in the most effective way. We measured an approximately 2 percentage higher f-value. This seems to be only a minor improvement at first sight, but notice that all of the correspodences in Ru are non trivial due to our choice of the partial reference alignment. This improvement is primarily based on generating an alignment with increased precision. ASMOV for example could increase its precision from 0.339 to 0.402. Only SAMBOdtf could also profit from the partial reference alignment by a slightly increased recall. Obviously, the partial reference alignment is mainly used in the context of a strategy which filters out incorrect correspondences.

Matching System Δ-Precision Δ-Recall Δ-F-value
SAMBO +0.024    0.636 &rarr 0.660 -0.002    0.626 &rarr 0.624 +0.011    0.631 &rarr 0.642
SAMBOdtf +0.040    0.563 &rarr 0.603 +0.008    0.622 &rarr 0.630 +0.025    0.591 &rarr 0.616
ASMOV +0.063    0.339 &rarr 0.402 -0.004    0.258 &rarr 0.254 +0.019    0.293 &rarr 0.312
RiMOM +0.012    0.700 &rarr 0.712 +0.000    0.370 &rarr 0.370 +0.003    0.484 &rarr 0.487
Table 3: Each row shows the increase in precision, recall and f-measure based on a comparison with the unknown part of the reference alignment. Value for M1 &rarr M4 are presented in smaller font.


Even though the submitted alignments have been generated on different machines, we believe that the provided runtimes are nevertheless useful and provide a basis for an approximate comparison.\footnote{Notice that most of the runtime information presented in table \ref{tab:anatomy-results1} has been provided by participants. Only for the two fastest systems, namely aflood and AROMA, runtimes have been measured by the track organizers on the same machine (Pentium D 3.4GHz, 2GB RAM).} Compared to the runtimes measured in last years competition we observe that systems with a high runtime in 2007 managed to decrease the runtime of their system significantly, e.g. Lily and ASMOV. Amongst all systems AROMA and aflood, both participating for the first time, perform best with respect to runtime efficiency. In particular, the aflood system achieves results of high quality in a very efficient way.


In last years evaluation we concluded that the use of domain related background knowledge is a crucial point in matching biomedical ontologies. This observation is supported by the claims made by other researchers. The current results partially support this claim, in particular the good results of the SAMBO system. Nevertheless, the results of RiMOM and Lily in subtrack #1 and #3 indicate that matching systems are able to detect a significant fraction of non trivial correspondences even though they do not rely on background knowledge.
In particular, we computed the union of the alignments generated by RiMOM (#3), Lily (#3) and SAMBO (#1). As a result we measured a recall of 92.25% with recall+ of 80% while each system on its own detects a significantly lower number of correspondences. Thus, there seems to be a significant potential of exploiting knowledge encoded in the ontologies. A combination of both approaches might result in a hybrid matching strategy that uses both background knowledge and the internal knowledge to its full extent.


Most of the results presented at this webpage will be published as part of the "Final Results of the OAEI 2008"-paper. If you are one of participants and detect some incorrect or missleading information, please contact Christian Meilicke (Email: christian (at) informatik (dot) uni-mannheim (dot) de). Comments are welcome!