Results for OAEI 2011

The following content is (mainly) based on the final version of the anatomy section in the OAEI results paper. If you notice any kind of error (wrong numbers, incorrect information on a matching system) do not hesitate to contact Christian Meilicke (mail see below).

Experimental setting

Contary to the previous years, we distinguish only between two evaluation experiments. Subtask #1 is about applying a matcher with its standard setting to the matching task. In the previous years we have also asked for additional alignments that favor precision over recall and vice versa (subtask #2 and #3). These subtasks are not part of the Anatomy track in 2011 due to the fact that the SEALS platform does not support to run tools with different configurations. Furthermore, we have proposed a fourth subtask, in which a partial reference alignment has to be used as additional input. In the preliminary version of this paper, we do not conduct evaluation experiments related to this specific matching task. We will report about several experiments - with varying input alignments - in the final version of the paper analyzing all tools that support this kind of functionality.

In our experiments we compare precision, recall, F-measure and recall+. We have introduced recall+ to measure the amount of detected non-trivial correspondences. From 2007 to 2009 we reported about runtimes measured by the participants themselves. This survey revealed large differences in runtimes. This year we can compare the runtimes of participants by executing them on our own on the same machine. We used a Windows 2007 machine with 2.4 GHz (2 cores) and 8GB RAM for generating the alignments.

For the 2011 evaluation we improved again the reference alignment of the data set. We removed doubtful correspondences and included several correct correspondences that have not been included in the past. As result we measured for the alignments generated in 2010 a slightly better F-measure (+ ~1%) compared to the computation based on the old reference alignment. For that reason we have also included the top-3 systems of 2010 with recomputed precision/recall scores in the results presentation of the following section.

Results

In the following we analyze the robustness of the submitted systems and their runtimes. Further, we report on the quality of the generated alignment, mainly in terms of precision and recall.

Robustness and Scalability

In 2011 there were 16 participants in the SEALS modality, while in 2010 we had only 9 participants for the anatomy track. However, this comparison is misleading. Some of these 16 systems are not really intended to match large biomedical ontologies. For that reason our first interest is related to the question, which systems generate a meaningful result in an acceptable time span. Results are shown in the following table. First, we focused on the question whether systems finish the matching task in less than 24h. This is the case for a surprisingly low number of systems. The systems that do not finish in time can be separated in those systems that throw an exception related to insufficient memory after some time (marked with 'X'). The other group of systems were still running when we stopped the experiments after 24 hours (marked with 'T'). We could not execute the two systems OACAS and OMR, not listed in the table, because the required interfaces have not been properly implemented.

Obviously, matching relatively large ontologies is a problem for five out of fourteen executable systems. The two systems MapPSO and MapEVO can cope with ontologies that contain more than 1000 concepts, but have problems with finding correct correspondences. Both systems generate comprehensive alignments, however, MapPSO finds only one correct corespondence and MapEVO finds none. This can be related to the way labels are encoded in the ontologies. The ontologies from the anatomy track differ from the ontologies of the benchmark and conference track in this respect.

For those systems that generate an acceptable result, we observe a high variance in measured runtimes. Clearly ahead is the system LogMap (24s), followed by Aroma (39s). Next are Lily and AgreementMaker (approx. 10mn), CODI (30mn), CSA (1h15), and finally MaasMatch (18h).

Results for subtask #1

The results of our experiments are also presented in the table above. Since we have improved the reference alignment, we have also included recomputed precision/recall scores for the top-3 alignments submitted in 2010 (marked by subscript 2010). Keep in mind that in 2010 AgreementMaker (AgrMaker) submitted an alignment that was the best submission to the OAEI anatomy track compared to all previous submissions in terms of F-measure. Note that we also added the base-line StringEquiv, which refers to a matcher that compares the normalized labels of two concepts. If these labels are identical, a correspondence is generated. Recall+ is defined as recall, with the difference that the reference alignment is replaced by the set difference of R - A_SE, where A_SE is defined as the alignment generated by StringEquiv.

This year we have three systems that generate very good results, namely AgreementMaker, LogMap and CODI. The results of LogMap and CODI are very similar. Both systems manage to generate an alignment with F-measure close to the 2010 submission of AgreementMaker. LogMap is slightly ahead. However, in 2011 the alignment generated by AgreementMaker is even better than in the previous year. In particular, AgreementMaker finds more correct correspondences, which can be seen in recall as well as in recall+ scores. At the same time, AgreementMaker can increase its precision. Also remarkable are the good results of LogMap, given the fact that the system finishes the matching task in less than half a minute. It is thus 25 times faster than AgreementMaker and more than 75 times faster than CODI.

Lily, Aroma, CSA, and MaasMatch (MaasMtch) have less good results than the three top matching systems, however, they have proved to be applicable to larger matching tasks and can generate acceptable results for a pair of ontologies from the biomedical domain. While these systems cannot (or barely) top the String-Equivalence baseline in terms of F-measure, they manage, nevertheless, to generate many correct non-trivial correspondences. A detailed analysis of the results revealed that they miss at the same time many trivial correspondences. This is an uncommon result, which might, for example, be related to some pruning operations performed during the comparison of matchable entities. An exception is the system MaasMatch. MaasMatch generates results that are highly similar to a subset of the alignment generated by the StringEquiv baseline.

Using an input alignment

This specific task was known as subtask #4 in previous OAEI campaigns. Originally, we planned to study the impact of different input alignments of varying size. The idea is that a partial input alignment, which might have been generated by a human expert, can help the matching system to find missing correspondences. However, taking into account only those systems that could generate a meaningful alignment in time, only AgreementMaker, implemented the required interface. Thus, a comparative evaluation is not possible. We may have to put more effort in advertising this specific subtask for next OAEI.

Alignment coherence

This year we also evaluated alignment coherence. The anatomy dataset contains only a small amount of disjointness statements, the ontologies under discussion are in EL++. Thus, even simple techniques might have an impact on the coherence of the generated alignments. For the anatomy dataset the systems LogMap, CODI, and MaasMatch generate coherent alignments. The first two systems put a focus on alignment coherence and apply special methods to ensure coherence. MaasMatch has generated a small, highly precise, and coherent alignment. The alignments generated by the other systems are incoherent. A more detailed analysis related to alignment coherence is conducted for the alignments of the conference dataset.

Conclusions

Less than half of the systems generate good or at least acceptable results for the matching task of the anatomy track. With respect to those systems that failed on anatomy, we can assume that this track was not in the focus of their developer. This means at the same time that many systems are particularly designed or configured for matching tasks that we find in the benchmark and conference track. Only few of them are robust 'all-round' matching systems that are capable of solving different tasks without changing their settings or algorithms.

The positive results of 2011 are the top results of AgreementMaker and the runtime performance of LogMap. AgreementMaker generated a very good result by increasing precision and recall compared to its least years submission, which was the best submission in 2010 already. LogMap clearly outperforms all other systems in terms of runtimes and still generates good results. We refer the reader to the OAEI papers of these two systems for details on the algorithms.

Additional remark

* In the overview section of the final version of the OAEI 2011 results paper, we put the following remark: AgreementMaker uses machine learning techniques to choose automatically between one of three settings optimized for the benchmark, anatomy and conference dataset. It uses a subset of the available reference alignments as input to the training phase and clearly a specific tailored setting for passing these tests.

Thus, Agreementmaker did participate in the Anatomy track (implicitly) with a specific setting, while all of the other systems, to our knowledge, used a general setting. This has to be taken into account when interpreting the results of AgreementMaker.

Acknowledgements

Again, we gratefully thank Elena Beisswanger (Jena University Language and Information Engineering Lab) for her thorough support on improving the quality of the data set. Moreover, we would like to thank Dominique Ritze (Library of University Mannheim), who helped with the analysis of the results and generated the results table above, which allows to sort by different criteria.

Contact

This track is organized by Christian Meilicke and Heiner Stuckenschmidt and supported by the SEALS project. If you have any problems working with the ontologies, any questions related to SEALS, or any suggestions related to the anatomy track, feel free to write an email to christian [at] informatik [.] uni-mannheim [.] de.