Anatomy - Results of 2010 Evaluation

The anatomy track confronts existing matching technology with a specific type of real world ontologies from the biomedical domain. In this domain a significant number of ontologies have been built covering different aspects of medical research.

Test Data and Experimental Setting

The data set of this track has been used since 2007. For a detailed description we refer the reader to the OAEI 2007 [2] results paper. The ontologies of the anatomy track are the NCI Thesaurus describing the human anatomy, published by the National Cancer Institute (NCI), and the Adult Mouse Anatomical Dictionary, which has been developed as part of the Mouse Gene Expression Database project. Both resources are part of the Open Biomedical Ontologies (OBO). The alignment between these ontologies has been created by experts of the domain [1].

As in the previous years, we divided the matching task into four subtasks. Subtask #1 is obligatory for participants of the anatomy track, while subtask #2, #3 and #4 are again optional tasks.

#1 The matcher has to be applied with its standard settings.
#2 An alignment has to be generated that favors precision over recall.
#3 An alignment has to be generated that favors recall over precision.
#4 A partial reference alignment has to be used as additional input.

Notice that in 2010 we used the SEALS evaluation service for subtask #1. In the course of using the SEALS services we published the complete reference alignment for the first time. In the future we plan to include all subtasks in the SEALS modality. This requires to extend the interfaces of the SEALS evaluation service to allow for example an (incomplete) alignment as additional input parameter.

The harmonization of the ontologies applied in the process of generating a reference alignment (see [1] and [2 resulted in a high number of rather trivial correspondences (61%). These correspondences can be found by very simple string comparison techniques. At the same time, we have a good share of non-trivial correspondences (39%). This is an important characteristic of the data set to be taken into account in the following analysis. The partial reference alignment used in subtask #4 is the union of all trivial correspondences and 54 non-trivial correspondences.

Due the experiences made in the past, we decided to slightly modify the test data set for the 2010 evaluation. We removed some doubtful subsumption correspondences and added a number of disjointness statement at the top of the hierarchies to increase the expressivity of the data set. Furthermore, we eliminated three incorrect correspondences. The reference alignment is now coherent with respect to the ontologies to be matched FOOTNOTE.

Results

While the number of participants is nearly stable over four years, we find in 2010 more systems that participated for the first time (5 systems) than in the previous years (in average 2 systems). An overview is presented in the table below. Four of the newcomers participate also in other tracks, while NBJLM participates only in the Anatomy track. NBJLM is thus together with AgreementMaker (AgrMaker) a system that uses a track-specific parameter setting. Taking part in several tracks with a standard setting makes it obviously much harder to obtain good results in a specific track.

*Particpants of the last years.*
System	2007	2008	2009	2010
AFlood
AgrMaker
AROMA
AOAS
ASMOV
BLOOMS
CODI
DSSim
Ef2Match
Falcon AO
GeRMeSMB
Kosimap
Lily
NBJLM
Prior+
RiMOM
SAMBO
SOBOM
TaxoMap
X SOM
Avg. f-measure	0.598	0.718	0.764	0.785

In the last row of the table the average of f-measures per year in subtask #1 is shown. We observe significant improvements over time. However, the measured improvements decrease over time and seem to reach a top (2007 +12% -> 2008 <+5% -> 2009 +2% -> 2010). We have marked the participants with an f-measure >= 0.8 with a symbol.

Runtime

In the previous years we reported about runtimes that have been measured by the participants. The differences we observed - from several minutes to several days - could not be explained by the use of different hardware. However, these differences became less significant over the years and in 2009 all systems except one required between 2 and 30 minutes. Therefore, we abstained from an analysis of runtimes this year. In 2011 we plan to execute the matching systems on the SEALS platform to enable an exact measurement of runtimes not biased by differences in hardware equipment. So far we refer the reader interested in runtimes to the result papers of the participants. Results papers can be found here soon..

Main results for subtask #1

The results for subtask #1 are presented in the following tableordered with respect to the achieved f-measure. In 2010 AgreementMaker (AgrMaker) generates the best alignment with respect to f-measure. Moreover, this result is based on a high recall compared to the systems on the following positions. This is a remarkable result, because even the SAMBO system of 2008 could not generate a higher recall with the use of UMLS. However, we have to mention again that AgreementMaker uses a specific setting for the anatomy track.

Results for subtask #1, #2, and #3 in terms of precision, recall (in addition recall+ for #1 and #3) and f-measure. Systems marked with a * do not participate in other tracks or have chosen a setting specific to this track. Note that ASMOV modified its standard settting in a restricted way (activating UMLS as additional resource). Thus, we did not mark this system. Some values for Ef2Match have been added after the final submission deadline (see below)
System	Task #1			Task #2			Task #3			Recall+
System	Precision	Recall	F-measure	Precision	Recall	F-measure	Precision	Recall	F-measure	Subtask #1	Subtask #3
AgrMaker*	0.903	0.853	0.877	0.962	0.751	0.843	0.771	0.874	0.819	0.630	0.700
Ef2Match	0.955	0.781	0.859	0.968	0.745	0.842	0.954 (0.766)	0.781 (0.838)	0.859 (0.800 )	0.440	0.440 (0.588)
NBJLM*	0.920	0.803	0.858	-	-	-	-	-	-	0.569	-
SOBOM	0.949	0.778	0.855	-	-	-	-	-	-	0.433	-
BLOOMS	0.954	0.731	0.828	0.967	0.725	0.829	-	-	-	0.315	-
TaxoMap	0.924	0.743	0.824	0.956	0.689	0.801	0.833	0.774	0.802	0.336	0.414
ASMOV	0.799	0.772	0.785	0.865	0.757	0.808	0.717	0.792	0.753	0.470	0.538
CODI	0.968	0.651	0.779	0.964	0.662	0.785	0.782	0.695	0.736	0.182	0.383
GeRMeSMB	0.884	0.307	0.456	0.883	0.307	0.456	0.080	0.891	0.147	0.249	0.838

AgreementMaker is followed by three participants (Ef2Match, NBJLM, SOBOM) that share a very similar characteristic regarding f-measure and observed precision score. All of these systems clearly favor precision over recall. A further analysis has to clarify to which degree the alignments generated by these systems are overlapping as indicated by their precision/recall characteristics. Notice that these systems obtained better scores or scores that are similar to the results of the top systems in the previous years. One explanation can be seen in the fact that the organizers of the track made the reference alignment available to the participants. More precisely, participants could at any time compute precision and recall scores via the SEALS services to test different settings of their algorithms. On the one hand this allows to improve a matching system by a constant formative evaluation in a direct feedback cycle, on the other hand it might happen that a perfect configuration results in problems for different data sets.

Recall+ and further results

In the following we use again the recall+ measure as defined in [2]. It measures how many non trivial correct correspondences, not detectable by string equivalence, can be found in an alignment. The top three systems with respect to recall+ regarding subtask #1 are AgreementMaker, NBJLM and ASMOV. Only ASMOV has participated in several tracks with the same setting. Obviously, it is not easy to find a large amount of non-trivial correspondences with a standard setting.

In 2010 six system participated in subtask #3. The top three systems regarding recall+ in this task are GeRoMe-SMB (GeRMeSMB), AgreementMaker and ASMOV. Since a specific instruction about the balance between precision and recall is missing in the description of the task, the results vary to a large degree. GeRoMe-SMB detected 83.8% of the correspondences marked as non-trival, but at a precision of 8\%. AgreementMaker and ASMOV modified their settings only slightly, however, they were still able to detect 70% and 53.8% of all non trivial correspondences.

In subtask #2 seven systems participated. It is interesting to see that systems like ASMOV, BLOOMS and CODI generate alignments with slightly higher f-measure for this task compared to the submission for subtask #1. The results for subtask #2 for AgreementMaker are similar to the results submitted by other participants for subtask #1. This shows that many systems in 2010 focused on a similar strategy that exploits the specifics of the data set resulting in a high f-measure.

Only about half of the participants submitted results for subtask #2 and #3. This can be related to an unclear description of the expected results. In the future we have to think about an alternative description of the subtask together with a different kind of evaluation to increase participation.

Results for subtrack #4

In the following we refer to an alignment generated for subtask #n as A_n. In our evaluation we use again the method introduced in 2009. We compare both A₁ ∪ R_p and A₄ ∪ R_p with the reference alignment R. Thus, we compare the situation where the partial reference alignment is added after the matching process against the situation where the partial reference alignment is available as additional resource exploited within the matching process. Note that a direct comparison of A₁ and A₄ would not take into account in how far the partial reference alignment was already included in A₁ resulting in a distorted interpretation.

Results are presented in the following table. Three systems participated in task #4 in 2010. Additionally, we added a row for the 2008 submission of SAMBOdtf. This system had the best results measured in the last years. AgreementMaker and ASMOV use the input alignment to increase the precision of the final result. At the same time these systems filter out some correct correspondences, finally resulting in a slightly increased f-measure. This fits with the tendency we observed in the past years (compare with the results for SAMBOdtf in 2008). The effects of this strategy are not very strong. However, as argued in the previous years, the input alignment has a characterictis that makes it hard to exploit this information.

*Changes in precision, recall and f-measure based on comparing subtask #1 and #4.*
System	Δ-Precision	Δ-Recall	Δ-Fmeasure
AgrMaker	0.025 (0.904 → 0.929)	0.025 (0.876 → 0.851)	0.002 (0.890 → 0.888)
ASMOV	0.029 (0.808 → 0.837)	0.016 (0.824 → 0.808)	0.006 (0.816 → 0.822)
CODI	0.002 (0.970 → 0.968)	0.030 (0.716 → 0.746)	0.019 (0.824 → 0.843)
SAMBOdtf in 2008	0.021 (0.837 → 0.856)	0.003 (0.867 → 0.870)	0.011 (0.852 → 0.863)

CODI has chosen a different strategy. While changes in precision are negligible, recall increases by 3%. Even though the overall effect is still not very strong, the system exploits the input alignment in the most effective way. However, the recall of CODI for subtask #1 is relatively low compared to the other systems. It is unclear whether the strategy of CODI would also work for the other systems where a ceiling effect might prevent the exploitation of the positive effects. We refer the interested reader to the results paper of the system for a description of the algorithm.

Conclusions

Overall, we see a clear improvement comparing this years results with the results of the previous years. This holds both for the average participant as well as for the top performer. A very positive outcome can be seen in the increased recall values. In addition to the evaluation experiments we reported, we computed the union of all submissions to subtask #1. For the resulting alignment we measured a precision of 69.7% and a recall of 92.7%. We added additionaly the correct correspondences generated in subtask #3 and reached a recall of 97.1%. Combining the strategies used by different matching systems it is thus possible to detect nearly all correct correspondences.

The availability of the SEALS evaluation service surely had an effect on the results submitted in 2010. We have already argued about pros and cons. In the future we plan to extend the data set of the anatomy track with additional ontologies and reference alignments to a more comprehensive and general track covering different types of biomedical ontologies. In particular, we will not publish the complete set of reference alignments to conduct a part of the evaluation experiment in the blind mode. This requires, however, to find and analyze interesting and well-suited data sets. The strategy to publish parts of the evaluation material and to keep other parts hidden seems to be the best approach.

The paragraphs above the horizontal rule is to 99% the content presented in the OAEI preliminary results paper. If you detect any errors or misleading interpretations, please report them to Christian Meilicke (address below). It is then possible to fix these issues directly in this webpage and then also in the final version of the paper.

2009 vs. 2010 version

We have above reported about slight differences in 2010 and 2009 version. To check in how far these changes might have affected the results, we applied the matcher AROMA and AFlood in 2009 setting on both version of the datasets and AgreementMaker (thanks to Cosmin Stroe!) with 2010 setting on both versions of the dataset. Results are presented in the following table.

Matcher/Setting	precision 09 → 10	recall 09 → 10	f-measure 09 → 10	\|A₀₉-A₁₀\| + \|A₁₀-A₀₉\|
Aroma (2009 setting)	0.770 → 0.769	0.678 → 0.683	0.721 → 0.724	33
AFlood (2009 setting)	0.873 → 0.871	0.653 → 0.664	0.747 → 0.753	173
AgrMaker (2010 setting)	0.896 → 0.903	0.848 → 0.853	0.872 → 0.877	128

Note that the differences in terms of precision, recall and f-measure are very small. It seems that the fixes helped matching systems to a very limited degree (even though the generated alignments themselves show some more significant differences in number of correspondences that do not overlap). However, AgreementMaker would also for the 2009 version of the dataset generate the alignment with the highest f-measure and very similar precision/recall characteristic.

Final Submissions of the 2010 participants

Download alignments

Modifications / Updates / Corrections

(1) Chua Wei Khong Watson reported that his submission of Eff2Match to subtasks #2 and #3 have not been included in the results. This was a mistake of the organizers and will be fixed in the final version of the OAEI results paper. We already applied the corresponding changes to this webpage.

(2) Chua Wei Khong Watson reported that his submission of Eff2Match to subtasks #3 was unfortunately based on a wrong file. The correct file has now been analysed and the corrected values have been added in parentheses.

Contact

This track is organized by Christian Meilicke and Heiner Stuckenschmidt. If you notice any errors or misleading remarks on this page, directly contact Christian Meilicke (email to christian [at] informatik [.] uni-mannheim [.] de).

Acknowledgements

We gratefully thank Elena Beisswanger (Jena University Language and Information Engineering Lab) for her thorough support on improving the quality of the data set. The modifications are documented here.

References

[1] Bodenreider O. and Hayamizu T. and Ringwald M. and De Coronado S. and Zhang S.: Of Mice and Men: Aligning Mouse and Human Anatomies. Proceedings of the American Medical Informatics Association (AIMA) Annual Symposium, 2005.

[2] J�r�me Euzenat et. al.: Results of the ontology alignment evaluation initiative 2007. In Proceedings 2nd International Workshop on Ontology Matching (OM-07), Busan (Korea), 2007.