This page informs about the MultiFarm dataset, a comprehensive dataset for multilingual ontology matching. The dataset can be downloaded and used for any kind of scientific purpose. Its generation and structure is briefly explained on this webpage, more details can be found in the following paper.
Christian Meilicke, Raúl García Castro, Fred Freitas, Willem Robert van Hage, Elena Montiel-Ponsoda, Ryan Ribeiro de Azevedo, Heiner Stuckenschmidt, Ondrej Svab-Zamazal, Vojtech Svatek, Andrei Tamilin, Cássia Trojahn, Shenghui Wang. MultiFarm: A Benchmark for Multilingual Ontology Matching. Accepted for publication at the Journal of Web Semantics.
Download the authors version of the paper
The following enumeration describes modifications that have been applied to the dataset after its first publication.
The dataset has been used in in the following experiments:
It would be nice of you could inform us (contact below) in case you use the dataset in an experimental evaluation.
The dataset has been generated by translating the existing OntoFarm dataset. The results of this first step are available in simple structured textfiles and can be downloaded from the following table. Please notice that all files are UTF-8 encoded. Some letters might be incorrectly displayed by your browser, because it does not detect the encoding correctly.
The results of the translation have been used to generate language specific variants of existing ontologies and reference alignment for all pairs of ontologies. These files are bundled in a single zip-file. They can be downloaded and used in any kind of scenario/experiment.
The zip-file is structured as follows:
ont/ cn/ cmt-cn.owl conference-cn.owl [for each ontology cmt, conference, confOf, edas, ekaw, iasted, sigkdd] cz/ (contains 7 files) cmt-cz.owl conference-cz.owl ... de/ (contains 7 files) cmt-de.owl conference-de.owl ... [a directory for each language cn, cz, de, en, es, fr, nl, pt, ru] ref/ cn-cz/ cmt-cmt-cn-cz.rdf cmt-conference-cn-cz.rdf cmt-conference-cz-cn.rdf cmt-confOf-cn-cz.rdf cmt-confOf-cz-cn.rdf ... conference-conference-cn-cz.rdf ... [overall 21*2=42+7*1 files] [a directory for each language pair cn-cz, cn-de, ...]
>>> Download the zipped bundle (old version)
>>> Download the zipped bundle (new version, used in OAEI 2012)
The dataset can also be used via the SEALS platform, where we have prepared and stored a testsuite for each language pair, resulting in 36 testsuites. You need an account for the SEALS platform to search and retrieve them from the test data repository.
>>> Link to the SEALS platform
You can, for example, find the testsuite for the language pair Czech-German if you just type 'cz-de' in the search field of the test data repository.
The dataset has been generated by a collaborative initiative of the following people.
Contact Cassia Trojahn or Christian Meilicke for further information related to this dataset.
Some users of the dataset have already detected some small bugs. In the future we will fix these bugs, for the moment we will just list them:
The logo at the top of this page is a modified version of a logo often used to refer to the Semantic Web. We have added the chinese signs for 'many' and 'language' to the original logo.