The following content is (mainly) based on the final version of the interactive section in the OAEI results paper.
If you notice any kind of error (wrong numbers, incorrect information on a matching system) do not hesitate to contact us.
A lot of Ontology Matching systems have been developed over the last years. After several years of experience, the results can only be slightly improved in terms of the alignment quality (precision/recall resp. F-Measure). Based on this insight, it is clear that fully automatic ontology matching systems slowly reach an upper bound of the results they can achieve. By incorporating user interaction, we expect to improve the alignments even further and to push this upper boundary. Semi-automatic ontology matching approaches are quite promising since humans can really help the systems, for example by detecting incorrect correspondences.
Whenever the user is directly involved, all required efforts of the human have taken into account and it has to be in an appropriate proportion to the result. Thus, beside the quality of the alignment, other measures like the number of interactions are interesting and meaningful to decide which matching system is best suitable for a certain matching task. By now, all OAEI tracks focus on fully automatic matching and semi-automatic matching is not evaluated although such systems already exist, e.g. LogMap2 (Jiménez-Ruiz et al., 2011) As long as the evaluation of such systems is not driven forward, it is hardly possible to systematically compare the quality of interactive matching approaches. With this new track, we like to change this unfavorable situation by explicitly offering a systematic, automated evaluation of matching systems with user interaction.
For the first edition of the interactive track, we use the well-established OAEI Conference data set. This data set covers 16 ontologies describing the domain of conference organization. We only use the test cases for which an alignment is publicly available (altogether 21 alignments). Over the last years, the quality of the generated alignments has been constantly increased but only to small amount (by a few percent). In 2012, the best system according to F-Measure (YAM++) achieves a value of 70% (Aguirre et al., 2012). This shows that there is significant room for improvement, which could be filled by interactive means.
Moreover, the Conference set has a suitable size such that most of the systems can participate and do not run into problems concerning the run time or memory consumption.
The interactive matching track was evaluated at OAEI 2013 for the first time. The goal of this evaluation is to simulate interactive matching (Paulheim2013), where a human expert is involved to validate mappings found by the matching system. In the evaluation, we look at how interacting with the user improves the matching results. For the evaluation, we use the conference dataset with the ra1 alignment, where there is quite a bit of room for improvement, with the best fully automatic, i.e., non-interactive matchers achieve an F-measure of 74%. The SEALS client was modified to allow interactive matchers to ask an oracle, which emulates a (perfect) user. The interactive matcher can present a correspondence to the oracle, which then tells the user whether the correspondence is right or wrong. All matchers participating in the interactive track support both interactive and non-interactive matching. This allows us to analyze how much benefit the interaction brings for the individual matchers.
Overall, five matchers participated in the interactive matching track: AML and AML-bk, Hertuda, LogMap, and WeSeE-Match. All of them implement interactive strategies that run entirely as a post-processing step to the automatic matching, i.e., take the alignment produced by the base matcher and try to refine it by selecting a suitable subset.
AML and AML-bk present all correspondences below a certain confidence threshold to the oracle, starting with the highest confidence values.They stop adding references once the false positive rate exceeds a certain threshold. Similarly, LogMap checks all questionable correspondences using the oracle. Hertuda and WeSeE-Match try to adaptively set an optimal threshold for cutting of mappings. They perform a binary search in the space of possible thresholds, presenting a correspondence of average confidence to the oracle first. If the result is positive, the search is continued with a higher threshold, otherwise with a lower threshold.
The results are depicted in Table 1. The biggest improvement in F-measure, as well as the best overall result (although almost at the same level as AML-bk), is achieved by LogMap, which increases its F-measure by four percentage points (compared to the non-interactive results). Furthermore, LogMap, AML and AML-bk show a statistically significant increase in recall as well as precision, while all the other tools except for Hertuda show a significant increase in precision. The increase in precision is in all cases however higher than the increase of recall. At the same time, LogMap has the lowest number of interactions with the oracle, which shows that it also makes the most efficient use of the oracle. In a truly interactive setting, this would mean that the manual effort is minimized.
On the other hand, Hertuda and WeSeE even show a decrease in recall, which cannot compensate for the increase in precision. The biggest increase in precision (17 percentage points) is achieved by WeSeE, but on an overall lower level than the other matching systems. Thus, we conclude that their strategy is not as efficient than those of the other participants. Interestingly, those two tools present more negative than positive examples to the oracle, while this relation is reversed for the more successful matching systems.
Compared to the results of the non-interactive conference track, the best interactive matcher (in terms of F-measure) is slightly below the best matcher (YAM++) with a F-measure value of 0.74. Except for YAM++, the interactive versions of AML-bk, AML and LogMap achieve better F-measure scores than all non-interactive matchers.
The results show that current interactive matching tools mainly use interaction as a means to post-process an alignment found with fully automatic means. There are, however, other interactive approaches that can be thought of, which include interaction at an earlier stage of the process, e.g., using interaction for parameter tuning (Ritze2011), or determining anchor elements for structure-based matching approaches using interactive methods. The maximum F-measure of 0.732 achieved shows that there is still room for improvement. Furthermore, different variations of the evaluation method can be thought of, including different noise levels in the oracle's responses (i.e., simulating errors made by the human expert), or allowing other means of interactions than the validation of single correspondences, e.g., providing a random positive example, or providing the corresponding element in one ontology, given an element of the other one.