About

The BMBF-funded research project Data Search for Data Mining (DS4DM) extends the data mining plattform RapidMiner with data search and data integration functionalities which enable analysts to find relevant data in large corpora of tabular data, and to semi-automatically integrate discovered data with existing local data. Specifically, the data search functionalities allow users to extend an existing local data table with additional attributes (=columns) of their choice. E.g. if you have a table with data about companies loaded into RapidMiner, you can ask the DS4DM RapidMiner Data Search Operator to extend your table with the additional column 'Headquarter location'. The operator will search the corpus of tabular data, that is loaded into the DS4DM backend, for relevant data, will add a new 'Headquarter location' column to the user's table, and will populate the column with values from the data corpus.

News

24th July 2018
Final release of the DS4DM Backend Components. This release includes some minor changes and bugfixes. The changes were made to the empty-string handling, the column name transformations, the parameter naming, the table fusion, the result-formatting. Also new parameters have be added for increased configurability of the data searches. The new version of the DS4DM can be downloaded from Github.

8th July 2018
A paper on the unconstrained and correlation-based Table Extension was accepted for the LWDA Conference 2018. The work will be presented there on the 22-24 August.

19th April 2018
5th release of the DS4DM Backend Components. The DS4DM backend has been extended with two new data search functions: Unconstrained Table Extension which adds all attributes to a query table which can be filled with data values so that a minimal density is reached; and Correlation-based Table Extension which adds all attributes to a query table which correlate with a specific attribute of the query tables. The new functions have been evaluated using the T2D Gold Standard. The extended DS4DM backend as well as the data that was used for the evaluation can be downloaded from Github.

27th November 2017
4th release of the DS4DM Backend Components. The new release provides improved versions of the keyword-based and correspondence-based data search methods as well as a detailed evaluation of the new methods using the T2D Gold Standard. The optimized DS4DM Backend Components as well as the data that was used for the evaluation can be downloaded from Github .

20th September 2017
Release of the third version of the DS4DM Backend Components. The new release allows the user to choose between different repositories/corpuses of tables and to create their own repositories using the newly introduced UploadTable functionality.

10th July 2017
Release of the second version of the DS4DM Backend Components. The new release includes new functionality for correspondence-based table search as well as new backend components for discovering instance- and schema-level correspondences between tables in pre-processing.

11th August 2016
Set up of a dedicated website for the DS4DM backend technologies and release of initial prototype code.

2nd June 2016
Our demonstration on Extending RapidMiner with Data Search and Integration Capabilities won the best demo award at ESWC2016, amongst the 19 demos accepted for presentation at the conference. More info on the Data and Web Science group news.

DS4DM Backend Components

The University of Mannheim team within the DS4DM project is developing the backend for the DS4DM RapidMiner Data Search Extension. The backend consists of four components.
The actual webservice:

  • DS4DM Webservice
    This webservice indexes and hosts a corpus of tabular data. Whenever the Rapidminer Data Search Extension requests a local table to be extended with an additional attribute, the webservice searches the corpus of tabular data for relevant tables (in our example: tables with company headquarter information) and returns these tables to the RapidMiner operator.

As well as three pre-processing components:

  • WebtableExtractor
    This component extracts tables from html pages and adds them to the webservice's data corpus.

  • IndexCreator
    This component indexes the data corpus. It relies on Lucene and allows the webservice to quickly search over the data corpus.

  • CorrespondenceCreator
    This pre-processing component discovers instance- and schema-level correspondences between tables in the corpus. These pre-calculated correspondences allow the webservice to return more complete results.


DS4DM Webservice

The webservice offers data search as well as repository maintainance functionality via a REST API.
The API specification is here. The Javadoc for the API-methods is here.

The most important methods are described in detail below:






Correspondence-based Search

(API specification, Javadoc)

The correspondence-based search method performs the same task as the keyword-based search: find tables that are relevant for table expansion within a repository of tabular data. The correspondence-based search however aims at achieving complete results by employing pre-calculated correspondences between the tables in the repository - see in the figure below. These schema- and instance- correspondences between tables in the repository are calculated during the creation of a repository. The method that is used to discover the correspondences is described below.

The correspondence-based search addresses the following limitation of the keyword-based search: If you want extend your table with the additional column “population”, the keyword-based search will not return tables containing an "inhabitants" column as the name of this column is too different from the name “population”. Nevertheless, the column "inhabitants" may contain data that is relevant for extending the query table.

The correspondence-based search method deals with this problem by using pre-calculated correspondences to expand the query result and return a more complete set of relevant tables: Initially, the method employs the same approach as the keyword-based method to find an initial set of tables that contain a column having a name similar to "population". Afterwards, the method searches the pre-calculated schema correspondences for correspondences that connect the discovered population columns with population columns in additional tables which have not been identified based on the column name yet. For instance, as the method for generating the correspondences also considers data values and not only column names, a pre-calculated correspondence could connect an “2012 Estimate” column in another table with a population column in one of the discovered tables and the table containing the “2012 Estimate” column would consequently also be added to the set of relevant tables . The pre-calculated correspondences also come to play for finding instance correspondences. In addition to comparing subject column values using string similarity - as in the keyword-based search – the pre-calculated instance correspondences are used to identify rows in the additional tables that describe entities which also appear in the query table . Again, as the instance correspondences are pre-calculated by not only comparing subject column values but also considering the values from other columns, the set of instance correspondences that is returned to RapidMiner is more complete than the set that results from only comparing subject column values.

correspondence-based search



Upload Table Functionality

Many use cases not only require extending local tables with data from public webtables, but users would like to extend tables using their own data, for example internal data from within a compary.
The UploadTable functionality allows users to create their own table repositories on the server and upload tables into these repositories. The backend indexes the uploaded tables using the IndexCreator Component. The backend also uses the CorrespondenceCreator Component to discover instance- and schema-correspondences between uploaded tables and tables that are already contained in a repository. These correspondences are employed afterwards by the correspondence-based search.


Preprocessing - WebtableExtractor

The WebtableExtractor extracts tables from HTML pages and transforms them into the representation that is used by the other components of the backend.

The Extractor is currently implemented as a batch process. We assume that input data is stored locally, e.g. in the form of HTML pages.

Preprocessing step

The process iterates over the locally stored pages, extracts useful tables and represent the output in a standardised format.
The extraction is performed with BasicExtraction algorithm which iterates through tables in the HTML page using the "table" tag. Heuristics are used to discard noisy tables:

  • tables inside forms
  • tables which contain sub-tables
  • tables with less than a certain number of rows (this parameter is currently set to 3)
  • tables with less than a certain number of columns (this parameter is currently set to 2)
  • tables with "rowspan" or "colspan"
  • tables that do not contain header cells ("th" element)

The remaining tables are classified as "layout" or "content" tables. The classification is done with the classifyTable method. It uses two models (SimpleCart_P1 and SimpleCart_P2). The first model identifies if the table is a LAYOUT table (only used for visualization formatting purposes). If this is not the case, then the second model is used to classify a table as:

  • RELATION (containing multiple entities and relations)
  • ENTITY (describing a single entity)
  • MATRIX
  • OTHER (if no type can be decided)

For all retained tables, the method additionally identifies:

  • the key column
  • the header row
  • context information

The key column detection selects the column with the maximal number of unique values. In case of a tie, the left-most column is used.
The header detection identifies a row which has a different content pattern for the majority of its cells, with respect to the other rows. Currently this test is performed only on the first non-empty row of the table against the others.
As context information for each table we select 200 characters before and 200 after the table itself.



Preprocessing - IndexCreator

The DS4DM webservice has to be fast at identifying the correct tables to return to the DS4DM RapidMiner operator. To achieve this speed despite the large corpus of data tables it has to search through, indexes are needed. In total 3 lucene indexes are used; they contain information about the data tables in the webservice's corpus.

The IndexCreator creates the following 3 Lucene indexes:

  • ColumnNameIndex
    This index has an entry for each column of each table (in the corpus).
    In every entry following information is saved: tableHeader (=table name), value (=column name), columnDataType, tableCardinality, columnDistinctValues, columnindex, columnOriginalHeader, fullTablePath (=folder in which the original table is located)

  • TableIndex
    This index has an entry for each distinct value in each column of each table.
    In every entry following information is saved: id (=the distinct-value-index for each column), tableHeader, columnHeader, columnDataType, tableCardinality, columnDistinctValues, valueMultiplicity (=how often the distinct value appears in this column), value (=the distinct value), fullTablePath, isPrimaryKey (=true if the column is the PK), originalValue

  • KeyColumnIndex
    This index has an entry for each table.
    In every entry following information is saved: tableHeader, columnHeader, keyColumnString (= a list of all the values in the table's key column concatenated into one long string), keyColumnIndex.
    This index is used for preprocessing, by the CorrespondenceCreator, as well as the Unconstrained- and Correlation-based- Table Extension

Preprocessing - CorrespondenceCreator

When a repository of data tables is created, the CorrespondenceCreator is automatically run. The CorrespondenceCreator finds correspondences between tables in the repository. These correspondences are later used by the Correspondence-based Search to improve its search results.


The following types of correspondences are generated:

  • Schema Correspondences
    A Schema Correspondence marks two columns from two different tables as containing the same attribute e.g. column1 from table1 and column4 from table3 both contain company sizes.

  • Instance Correspondences
    An Instance Correspondence marks two rows from two different tables as referring to the same real-world-object e.g. row2 from table1 and row11 from table3 both contain information about the company Tesla.


In order to find these correspondences the CorrespondenceCreator executes the following steps:

  1. Blocking
    Looking for correspondences between all tables would be too computationally expensive. In the blocking step, we therefore generate pairs of tables that are likely to have correspondences.
    For a given table the five tables with the most similar subject columns are chosen as likely corresponding tables. The KeyColumnIndex is used for comparing the subject columns.

  2. Instance-matching
    The instance correspondences between two tables are found by comparing the values from a row in the first table to the values from a row in the second table. For comparing these values data-type-specific similarity metrics are used. Also, the similarity of the subject column values is given a weight of 0.5. While the remaining 0.5 weight points are equaliy divided over all other columns. Row combinations with an overall similarity score above 0.55 are considered instance matches. This threshold can be configured in the config file.

  3. Schema-matching
    Here we find corresponding columns in the two tables, by comparing their column values and column headers. When comparing the column values the knowledge about the instance-matches helps us get a better accuracy. Here too, data type-specific similarity measures are used for comparing column values.

Steps 2. and 3. of the correspondence creation process are implemented using the WInte.r Data Integration Framework.

Evaluation

In the following, we evaluate different aspects of the DS4DM backend components. First, the T2D Goldstandard is presented. This Goldstandard was used for many of the evaluations. Afterwards, the evaluations of the Constrained-, the Unconstrained- and the Correlation-based- Table Extension are presented. Finally, the Table Upload Functionality and the Preprocessing Components are evaluated.
The data from all the evaluations can be downloaded here.


T2D Goldstandard
For multiple of the evaluations, the T2D Goldstandard V2 is used. This is a collection of 779 tables that were extracted from HTML pages and cover a wide range of topics - such as populated places, organizations, people, music, etc. The tables were manually mapped to the DBpedia knowledge base. For our evaluation, we derive schema- and instance-correspondences between the different tables of the goldstandard by looking for pairs of schemas/instances that both have been mapped to the same DBpedia entry.


Evaluation of the Constrained Table Extension
The evaluation data can be downloaded from here.
We measure the density (completeness) of the attribute that is added to different query tables as well as the quality of the values of the newly added attribute. We use the T2D Goldstandard V2 for the evaluation.

For evaluating the table search methods we use fifteen different query tables that each should be extended by one specific attribute. Using the correspondences of the T2D Goldstandard, we inferred the best possible population of this extension attribute for the fifteen tables. The evaluation results below where calculated by comparing the values of the extension attributes that were added to the tables by the two search algorithms with the optimal values of the attributes as derived from the goldstandard. The complete details about the evaluation including the 15 query tables as well as the content of the attributes that were added to the tables can be found here.

Evaluation tables for the Datasearch

The table on the left contains the results obtained using the keyword-based search method. You will notice, that in general the algorithm was able to populate the extension column with a high density and accuracy. There are two exceptions for which the algorithm as not able to populate the extension column properly: Mountain-MountainRange and Film-ReleaseDate.
The table on the right shows the results of the correspondence-based search method. Especially for Film-ReleaseDate and Country-Code, due to the additionally found data, many more values in the extension column could be populated (“missing values filled” is high).
In the special cases of Country-Currency and Game-Developer, some of the additionally found data was wrong. In this case, extension-column values which had previously been correctly populated, now were wrongly populated (negative “difference in correct values after fusion” values).

Evaluation of the Unconstrained Table Extension
The evaluation data can be downloaded from here.
For the evaluation of the Unconstrained Table Extension, 13 different tables with different types of entities (Airports, Currencies, Lakes, etc.) were extended using the tables from the T2D Goldstandard as repository. The resulting extended tables were compared with ideal solutions to calculate the Precision-, Recall- and F1- scores shown below.

The ideal solutions were generated the following way:

  1. Identify the subject-columns of the tables in the T2D Goldstandard
  2. Determine groups of corresponding subject columns, using the known schema correspondences of the T2D Goldstandard
  3. Select the 13 largest groups. (This is how the 13 types of entities were determined - Airports, Currencies, Lakes, etc.)
  4. For each group: Fuse the tables whose subject column is in this group, by using the known schema- and instance- correspondences between these tables.
    When, for the fusion, several values are associated with one position, all of them are kept - in the evaluation all of them are counted as correct.


Evaluation of unconstrained table extensioin

Analysis of the results
The Recall is better than the Precision for most of the examples. This is because the ideal solutions were generally not as well populated as the results of the Unconstrained Table Extension. The ideal solutions only contained values that have entity- and schema- correspondences to other tables, whereas the results used all values from the found tables.
A lot of the other differences can be explained by the quality and the similarity of the underlying tables – for Lakes and Journals the underlying tables were very different; for Animals and Airlines they were very similar.

Evaluation using Product Data from different E-Shops
Another evaluation of the Unconstrained Table Extension was performed using product data that had extracted from thirteen e-shops using a focused crawler. The Evaluation was performed in much the same way as with the T2D_Goldstandard data (see above). The evaluation data as well as a detailed description of the evaluation setup is found here.


Evaluation of unconstrained table extension with product data scraped by the focussed crawler


Evaluation of the Correlation-based Table Extension
The evaluation data can be downloaded from here.
The Evaluation of Unconstrained Table Extension was done with 13 query tables. For the evaluation of the Correlation-based Table Extension, from these thirteen tables, the four were considered that have the most numerical columns.

For these tables the additional columns found by the Correlation-based Table Extension were compared with the columns found when applying the Correlation-based Filtering to the ideal solutions.
The following results were obtained:

Precision and Recall for correlation based table extension

Analysis of the results
The scores are not 100%, because the Unconstrained Table Extension doesn’t work perfectly. The recall is better, because the tables generated by the Unconstrained Table Extension generally have more columns. So, more of columns are found to be correlating. Other things that cause wrong correlations are: columns with very low density, values with different measurement units (e.g. km2 and m2) in the same column.

Evaluation of the Table Upload Functionality
The uploadTable functionality allows users to add individual tables to an existing corpus of tables. The uploaded tables are indexes and correspondences are created between the newly uploaded table and the tables that are already in the repository. Uploading and indexing 1000 tables one by one takes approximately 2 hours (7.2 seconds per table). Alternatively, the bulkTableUpload functionality can be used for uploading larger amounts of tables. Processing the same 1000 tables using the bulkTableUpload only requires 10 minutes (0.6 seconds per table). These times were measured using a machine with 13GB of RAM and four 2.6GHz processors.


Evaluation of the Preprocessing Components

Evaluation of the Correspondence Discovery

The evaluation data can be downloaded from here.
We also use the T2D Goldstandard for evaluating the quality of the instance- and schema-level correspondences between tables that are discovered by the CorrespondenceCreator. The matching method that is employed by the CorrespondenceCreator reaches the following F1 scores:

  • Instance Matching
    F1 = 0.753 (Precision = 0.949; Recall = 0.624)

  • Schema Matching
    F1 = 0.755 (Precision = 0.803; Recall = 0.712)


Evaluation of the Runtime Performance of the Correspondence Discovery

The Perfomance of the Backend components was evaluated using the Wikitables dataset. This dataset consists of 1.6 million tables out of which 541 thousand are relational tables with a subject column and a minimum size of 3x3.
The IndexCreatior needs 2 hours to index these 541 thousand tables. The CorrespondenceCreator needs 4 days to process theses tables.
These times were measured using a machine with 8GB of RAM and a 3.1GHz processor.


Evaluation of Blocking Step

The evaluation data can be downloaded from here.
The CorrespondenceCreator employs a blocking step to reduce the number of table comparisons that are needed for identifying correspondences. The blocking technique clusters tables using a bag-of-words approach. When used to find likely matching pairs in the T2D Goldstandard, the blocking technique achieves a reduction ratio of 0.992 and a pair completenes of 0.701. The harmonic mean of these two values is 0.822.


Resources

Official Project Website

The official general website for the DS4DM project.

DS4DM code

The source code of DS4DM on GitHub.
The repository contains the server side components of the DS4DM project.

DS4DM videos

The DS4DM YouTube channel.

Publications

Anna Lisa Gentile, Petar Ristoski, Steffen Eckel, Dominique Ritze, and Heiko Paulheim: Entity Matching on Web Tables : A Table Embeddings Approach for Blocking. In: 20th International Conference on Extending Database Technology (EDBT 2017), Venice, Italy, March 2017.
Dominique Ritze, Christian Bizer: Matching Web Tables To DBpedia - A Feature Utility Study. 20th International Conference on Extending Database Technology (EDBT2017), Venice, Italy, March 2017.
Anna Lisa Gentile, Sabrina Kirstein, Heiko Paulheim and Christian Bizer: Extending RapidMiner with data search and integration capabilities. In The Semantic Web: ESWC 2016 - Satellite Events. Springer, 2016.
Petar Petrovski, Anna Primpeli, Robert Meusel, Christian Bizer: The WDC Gold Standards for Product Feature Extraction and Product Matching. 17th International Conference on Electronic Commerce and Web Technologies (EC-Web 2016), Porto, Portugal, September, 2016.
Oliver Lehmberg, Dominique Ritze, Petar Ristoski, Robert Meusel, Heiko Paulheim, Christian Bizer: The Mannheim Search Join Engine. Journal of Web Semantics, 2015.

People


The following people work on the DS4DM project at the University of Mannheim:

Benedikt KleppmannData and Web Science Group, University of Mannheim, Germany

Heiko PaulheimData and Web Science Group, University of Mannheim, Germany

Chris BizerData and Web Science Group, University of Mannheim, Germany