The following content is (mainly) based on the final version of the library section in the OAEI results paper.
If you notice any kind of error (wrong numbers, incorrect information on a matching system) do not hesitate to contact us.
Libraries play an important role in the linked data web, and they widely agree that linked data technologies are ideal to integrate the data of libraries around the world and to foster the collaboration on cataloguing among the libraries. Library data does not only consist of the vast amount of cataloguing data, but especially -- and probably more interesting for other communities -- also of authority data, i.e., normed descriptions of locations, events, persons, corporate bodies, and subject concepts. The subject concepts are usually organized in more or less hierarchical knowledge organization systems, together with semantic relations between the concepts. A thesaurus is such a knowledge organization system that is used for indexing purposes and that provides quasi-synonymous, describing labels for each concept. Thesauri are sometimes referred to as lightweight ontologies, however, we will see that this definition can be misleading.
Thesauri, and authority data in general, have a long history in libraries and are actively used and maintained by information professionals and domain experts. Due to their high quality and their long-term development, they could function as a "backbone of the Semantic Web".Most thesauri are domain-dependent and specialized to be used within a certain field, e.g., to index publications with an economical focus. During previous experiments, we examined the topical overlap between the two thesauri used in this challenge: TheSoz (social sciences) and STW (economics). They share not only a lot of concepts, there is also a manually created alignment that can be used as reference. Many thesauri exist that cover the same or overlapping domains, often in different languages. Multilingual thesauri are an important means to bridge the gap between catalogs in different languages, so that users can search for relevant literature using their own language. Another possibility is the creation of links between concepts across different thesauri, possibly in different languages. Such alignments -- or correspondences or cross-concordances -- can be exploited to mutually add further information to both thesauri and subsequently improve the retrieval. Therefore, for many, selected thesauri exist alignments that are manually created by domain experts. Nevertheless, the automatic identification of alignments is strongly desired, mainly due to two reasons: First, the manual creation of alignments between all existing thesauri is not feasible, so additional alignments have to be created, possibly by exploiting existing alignments (e.g., their transitivity). Second, automatically created alignments can be used to improve and enhance existing alignments, after approval by a domain expert. This is necessary, as most existing alignments are not complete and even if they are supposed to be complete, they have to be maintained just like the thesauri themselves, i.e., a constant effort is required to keep them up-to-date.
This library track is a new track within OAEI. However, there has already been a library track from 2007 to 2009 using different thesauri, as well as other thesaurus tracks like the food track and the environment track. A common motivation is that these tracks use a real-world scenario, i.e., real thesauri. For us, it is still a motivation to develop a better understanding, how thesauri differ from ontologies and how these differences affect state-of-the-art ontology matchers. We hope that the community accepts the challenge and that subsequently significant improvements can be seen that push the quality of automatic alignments between thesauri. Furthermore, we will use the matching results as input for the maintainers of the reference alignment to improve the alignment. While a full manual evaluation of all matching results is certainly not feasible, this way we constantly improve the reference alignment and mitigate possible weaknesses and incompleteness.All systems listed in the table above are sorted according to their F-measure values. Altogether 13 of the 21 submitted matching systems were able to create an alignment. Three matching systems (MaasMatch, MEDLEY, Wmatch) did not finish within the time frame of one week while five threw an exception.
Of all these systems, GOMMA performs best in terms of F-measure, closely followed by ServOMapLt and LogMap. However, the precision and recall measures vary a lot across the top three systems. Depending on the application, an alignment either achieving high precision or recall is preferable. If recall is in the focus, the alignment created by GOMMA is probably the best choice with a recall about 90%. Other systems generate alignments with higher precision, e.g. ServOMap with over 70% precision, while mostly having significantly lower recall values (except for Hertuda).
From the results obtained by the matching strategies taking the different types of labels into account, we can see that a matching based on preferred labels only, outperforms other matching strategies. MatcherPref achieves the highest F-measure in these tests. The results of MatcherPrefDE and MatcherPrefEN provide an insight into the language characteristics of both thesauri and the reference alignments. MatcherPrefDE achieves the highest precision value (nearly 90%), albeit with a recall of only 60%. Both thesauri as well as the reference alignment have been developed in Germany and focus on German terms. From the results of MatcherPrefEN, we can see the difference: precision and especially recall significantly decrease when only the preferred English labels are used. On the one hand, only about 80% of the found correspondences are correct and on the other hand, less than a half of all correspondences can be found this way. This can be a disadvantage for systems that use NLP techniques on English labels or rely on language-specific background knowledge like WordNet.The high precision values of the pref matchers reflect the fact that the preferred labels are chosen specifically to unambiguously identify the concepts. Our interpretation is that the English translations are partly not as precise as the original German terms (drop in precision) and not consistent regarding the English terminology (drop in recall).
In contrast, the MatcherAllLabels achieves a quite high recall (90%) but a rather low precision (54%). This means that most but not all of the corresondences can be found by only having a look at equivalent labels. However, when following this idea, nearly a half of the found correspondences are incorrect. The rather high F-measure of MatcherAllLabels is therefore misleading, as at least if the results would be used unchecked in an retrieval system, a higher precision would clearly be preferred over a higher recall. In this respect, matchers like ServOMap show better results. In any case, it can be seen that a matching system using the original SKOS version could achieve a better result. The information loss when converting SKOS to OWL really matters.Concerning the runtime, LogMap as well as ServOMap are quite fast with a runtime below 50 seconds. These values are comparable or even better (LogMapLt) than both strategies computing the equivalence between preferred labels. Thus, they are very effective in matching large ontologies while achieving very good results. Other matchers take several hours or even days and do not produce better alignments in terms of F-measure. By computing the correlation between F-measure and runtime, we notice a slightly negative correlation (-0.085) but the small amount of samples is not sufficient to make a significant statement. However, we can say for certain that a longer runtime does not necessarily lead to better results.
We further observe that the n:m reference alignment affects the results because some matching systems (ServOMap, WeSeE, HotMatch, CODI, MapSSS) only create 1:1 alignments and discard correspondences with entities that already occur in another correspondence. Whenever a system creates a lot of \textit{n:m} correspondences, e.g., Hertuda and GOMMA, the recall significantly increases. This difference becomes clear when comparing ServOMapLt and ServOMap. Both systems mostly base on the same methods but ServOMapLt does not use the 1:1 filtering. Consequently, the recall increases and the precision decreases.Since the reference alignment has not been updated for about six years, it does not contain updates of both thesauri. Thus, new correct correspondences might be found by matching systems but they are indicated as incorrect because they are not included in the reference alignment. Therefore, we applied a manual evaluation to check whether the matching systems found correct correspondences which are not included in the reference alignment at all. In turn, these information can help to improve the reference alignment.
The manual evaluation has been conducted by domain experts. All newly detected correspondences, which have not been contained into the reference alignment yet, have been considered. Because exact matches have to be 1:1 relationships, only those correspondences have been examined, whose terms are descriptors and not yet involved into an existing correspondence. The other correspondences are considered as wrong as they contain a term, to or from which already a correspondence exists.Since all matching systems delivered correspondences representing exact matches, they have been judged in this specific regard. That means that correspondences have been considered as wrong for now, whose terms cannot be seen as equivalent but maybe as related, broader or narrower.
The matchers detected between 38 and 251 correspondences, which have not been in the reference alignment before. This includes especially terms, which hold a strong syntactical similarity or equivalence. But, some matching systems even detected difficult correspondences, e.g., between the German label for "automated production" ("Automatische Produktion") and "CAM", which has been identified by their associated non-preferred labels. Furthermore, correspondences of geographical terms have been detected, but some of the matchers have not been able to distinguish between the terms for citizens of a country, their language or the country itself, although these differences can be derived from the structure of the thesauri.
But, the manual evaluation exposed several issues, which can either be explained by the typical behavior of matching systems or by domain-specific differences inside the thesauri. There are similar terms inside TheSoz and STW, which are used in totally different contexts, e.g. the term "self-assessment". Even when considering the structure of both thesauri these differences are difficult to identify. In general, term similarities often led to wrong correspondences, which is not surprising at first. But, in turn syntactically equal terms have not been detected simultaneously in some cases. By now, we did not have the possibility to evaluate the matching systems with the improved reference alignment but we plan to perform this additional evaluation soon.
It is the first time this track takes place, so we cannot compare the results with previous ones. As it is also the first time for the matching systems participating in this track, they do not have any experience with the data. This has to be kept in mind if the results are compared to other tracks.
Nevertheless, the newly detected correspondences determine already a useful result for the maintainers of the two thesauri. The correct correspondences can be added to the existing reference alignment, which is already applied in information portals for supporting search term recommendation and query expansion services among differently indexed databases. As all matching systems delivered exact matches for the correspondences, some of the wrong correspondences will be examined again in the future, whether other relationships like broader, narrower or related matches can be considered for those.
We expect further improvements, if the matchers are tailored more specifically to the library track, i.e., if they exploit the information found in the original SKOS version. A promising approach is also the use of additional knowledge, e.g., instance data -- resources that are indexed with different thesauri.
This time, we collected the results of the matchers as a first survey and compared them to our simple string-matching strategy that takes advantage of the different types of labels. In future evaluations, we assume that better results can be achieved and that these strategies simply form a baseline.
We would like to thank Andreas Oskar Kempf from GESIS for the manual evaluation of the new detected correspondences.