News

24th July 2018

Final release of the DS4DM Backend Components. This release includes some minor changes and bugfixes. The changes were made to the empty-string handling, the column name transformations, the parameter naming, the table fusion, the result-formatting. Also new parameters have be added for increased configurability of the data searches. The new version of the DS4DM can be downloaded from Github.

8th July 2018

A paper on the unconstrained and correlation-based Table Extension was accepted for the LWDA Conference 2018. The work will be presented there on the 22-24 August.

19th April 2018

5th release of the DS4DM Backend Components. The DS4DM backend has been extended with two new data search functions: Unconstrained Table Extension which adds all attributes to a query table which can be filled with data values so that a minimal density is reached; and Correlation-based Table Extension which adds all attributes to a query table which correlate with a specific attribute of the query tables. The new functions have been evaluated using the T2D Gold Standard. The extended DS4DM backend as well as the data that was used for the evaluation can be downloaded from Github.

27th November 2017

4th release of the DS4DM Backend Components. The new release provides improved versions of the keyword-based and correspondence-based data search methods as well as a detailed evaluation of the new methods using the T2D Gold Standard. The optimized DS4DM Backend Components as well as the data that was used for the evaluation can be downloaded from Github .

20th September 2017

Release of the third version of the DS4DM Backend Components. The new release allows the user to choose between different repositories/corpuses of tables and to create their own repositories using the newly introduced UploadTable functionality.

10th July 2017

Release of the second version of the DS4DM Backend Components. The new release includes new functionality for correspondence-based table search as well as new backend components for discovering instance- and schema-level correspondences between tables in pre-processing.

11th August 2016

Set up of a dedicated website for the DS4DM backend technologies and release of initial prototype code.

2nd June 2016

Our demonstration on Extending RapidMiner with Data Search and Integration Capabilities won the best demo award at ESWC2016, amongst the 19 demos accepted for presentation at the conference. More info on the Data and Web Science group news.

DS4DM Webservice

The webservice offers data search as well as repository maintainance functionality via a REST API.
The API specification is here. The Javadoc for the API-methods is here.

The most important methods are described in detail below:

Constrained Table Extension
Unconstrained Table Extension
Correlation-based Table Extension
Upload table functionality

Constrained Table Extension

The Constrained Table Expansion allows the user to extend a given table with an additional column. This is done by finding the data necessary for populating this additional column within a repository of data tables.

There are two methods that can be used for finding the right data for populating the additional column: Keyword-based Search and Correspondence-based Search.

Keyword-based Search

(API specification, Javadoc)

The keyword-based search method allows the user to extend a given table with an additional column. This is done by finding the data necessary for populating this additional column within a repository of data tables.

Lets assume that a user has loaded a table describing countries into RapidMiner. The table has the two columns "country" and "GDP". The user now wants to have the additional column "population" added to the table and filled with data from the corpus - see step ⓪ in figure below. In step ①, the keyword-based search algorithm searches for tables in the repository, that have a column with a column name similar to "population". Afterwards (step ②), the algorithm determines correspondences between instances (rows) in the found tables and rows the query table, by comparing the subject-column values of these two tables – in our case: comparing the names of the countries. For the comparison, the subject column values are normalized and compared using Fuzzy-Jaccard with a threshold of 0.75. The threshold can be changed in the config-file. The identification of the subject columns is done by a combination of the same string comparison on column headers and a subject-column-detection algorithm [1].

Correspondence-based Search

(API specification, Javadoc)

The correspondence-based search method performs the same task as the keyword-based search: find tables that are relevant for table expansion within a repository of tabular data. The correspondence-based search however aims at achieving complete results by employing pre-calculated correspondences between the tables in the repository - see ⓪ in the figure below. These schema- and instance- correspondences between tables in the repository are calculated during the creation of a repository. The method that is used to discover the correspondences is described below.

The correspondence-based search addresses the following limitation of the keyword-based search: If you want extend your table with the additional column “population”, the keyword-based search will not return tables containing an "inhabitants" column as the name of this column is too different from the name “population”. Nevertheless, the column "inhabitants" may contain data that is relevant for extending the query table.

The correspondence-based search method deals with this problem by using pre-calculated correspondences to expand the query result and return a more complete set of relevant tables: Initially, the method employs the same approach as the keyword-based method to find an initial set of tables that contain a column having a name similar to "population". Afterwards, the method searches the pre-calculated schema correspondences for correspondences that connect the discovered population columns with population columns in additional tables which have not been identified based on the column name yet. For instance, as the method for generating the correspondences also considers data values and not only column names, a pre-calculated correspondence could connect an “2012 Estimate” column in another table with a population column in one of the discovered tables and the table containing the “2012 Estimate” column would consequently also be added to the set of relevant tables ①. The pre-calculated correspondences also come to play for finding instance correspondences. In addition to comparing subject column values using string similarity - as in the keyword-based search – the pre-calculated instance correspondences are used to identify rows in the additional tables that describe entities which also appear in the query table ②. Again, as the instance correspondences are pre-calculated by not only comparing subject column values but also considering the values from other columns, the set of instance correspondences that is returned to RapidMiner is more complete than the set that results from only comparing subject column values.

Unconstrained Table Extension

(API specification, Javadoc)

The Constrained Table Extension extends a provided table with exactly one additional column. The Unconstrained Table Extension on the other hand extends the provided table with as many columns as possible.
There is the restriction that the new extension columns have to have a minimum density of 10%. Another difference between the Constrained Table Extension and the Unconstrained Table Extension is the following: For the Constrained Table Extension the Backend API functions search for the correct data and the correspondences needed for fusing the data/populating the extension column, the actual fusion and population however happens in the front-end. This allows the user greater control over the fusion process. For the Unconstrained Table Extension on the other hand the fusion and population is done in the backend, as there are just too many variables involved for a user to effectively manage the fusion process.

The Algorithm for Unconstrained Table Extension works in the following way:
(An illustration of the steps is in the figure below.)

The subject column of the provided table is automatically detected.
First, the program checks whether one of the column headers matches one of the fifteen regex-patterns for subject columns headers (such as “.*name”). If no column header was identified as a subject column header, then the string-column with the biggest amount of distinct values is chosen as the subject column.

The KeyColumnIndex is used to find the tables in the repository with similar subject columns.
This Lucene index uses a tf-idf-like similarity score to calculate the amount of overlap between the values of the subject column of the provided table and the subject column of each of the tables in the repository.

The instance matches between the provided table and the tables that were found in step② are determined.
For this a similarity is calculated for each instance pair. This similarity is made up of the similarity between the subject-column values (50%) and the maximal similarity of the non-subject-column values (50%). The similarities are calculated using datatype-specific similarity metrics (string-similarity, number-similarity and date-similarity).

The schema correspondences between all of the tables (the provided table and the matching tables from the repository) are determined. Schema correspondences are found by either
• Label-based Schema Matching - considers the similarity between column headers, or
• Instance-based Schema Matching - considers the similarity of values for the matched entities.

Grouping of Columns according to Schema correspondences.
Corresponding columns/schemas contain information about the same attribute. The columns of the tables are grouped, so that there is one group per attribute.

For each group, the columns of that group are fused into one.
First, the instance correspondences are used to determine at which position of the fused column a value belongs. Next, if there are several values that belong at the same position, then one of them is chosen. This done with the following rules:
• If available, then choose the value from the provided table
• If not, then do similarity-based voting - the value with the greatest similarity to the other values is chosen.

Correlation-based Table Extension

The Unconstrained Table Extension can add over 100 columns to a provided table (when the used repository is large). This amount of columns can be overwhelming for the user. The Correlation-based Table Extension was developed in order to address this issue. Instead of extending a provided table with as many columns as possible, it extends the provided table only with those columns that correlate to a specified attribute in the provided table – the ‘correlation attribute’.

In practice the Correlation-based Table Extension is implemented as an extension of the Unconstrained Table Extension. In the first step the Unconstrained Table Extension is used to add as many columns as possible to the provided table. In the second step ‘Correlation-based Filtering’ is applied to the extended table in order to only keep those columns that correlate with the correlation attribute.

The Unconstrained Table Extension Algorithm is described above. The Correlation-based Filtering Algorithm goes through the following steps:

Columns that should be numbers are converted to numbers.
Often numeric columns are interpreted as strings, this is because of the following issues with the data:
- The values contain additional symbols and measurement units such as ‘%’, ‘€’, etc.
- Individual values are strings such as ‘not available’ or ‘-’
- The decimal separator might be a ‘.’ or a ‘,’ depending on where the data was captured. (the thousands separator can be a ‘,’, ‘.’ or ‘ ’)
There is a function that recognizes which of these issues appear in the columns of the extended table and converts them to numeric columns.

Calculate Correlations
Only correlations between numeric columns are considered. These are calculated with Pearson correlation coefficient. The correlations between categorical variables or between a numeric and a categorical variable are not useful, as mainly surface forms are found as highly correlating variables. Only columns with a correlation bigger than 0.8 are kept.

Upload Table Functionality

Many use cases not only require extending local tables with data from public webtables, but users would like to extend tables using their own data, for example internal data from within a compary.
The UploadTable functionality allows users to create their own table repositories on the server and upload tables into these repositories. The backend indexes the uploaded tables using the IndexCreator Component. The backend also uses the CorrespondenceCreator Component to discover instance- and schema-correspondences between uploaded tables and tables that are already contained in a repository. These correspondences are employed afterwards by the correspondence-based search.

Preprocessing - WebtableExtractor

The WebtableExtractor extracts tables from HTML pages and transforms them into the representation that is used by the other components of the backend.

The Extractor is currently implemented as a batch process. We assume that input data is stored locally, e.g. in the form of HTML pages.

The process iterates over the locally stored pages, extracts useful tables and represent the output in a standardised format.
The extraction is performed with BasicExtraction algorithm which iterates through tables in the HTML page using the "table" tag. Heuristics are used to discard noisy tables:

tables inside forms
tables which contain sub-tables
tables with less than a certain number of rows (this parameter is currently set to 3)
tables with less than a certain number of columns (this parameter is currently set to 2)
tables with "rowspan" or "colspan"
tables that do not contain header cells ("th" element)

The remaining tables are classified as "layout" or "content" tables. The classification is done with the classifyTable method. It uses two models (SimpleCart_P1 and SimpleCart_P2). The first model identifies if the table is a LAYOUT table (only used for visualization formatting purposes). If this is not the case, then the second model is used to classify a table as:

RELATION (containing multiple entities and relations)
ENTITY (describing a single entity)
MATRIX
OTHER (if no type can be decided)

For all retained tables, the method additionally identifies:

the key column
the header row
context information

The key column detection selects the column with the maximal number of unique values. In case of a tie, the left-most column is used.
The header detection identifies a row which has a different content pattern for the majority of its cells, with respect to the other rows. Currently this test is performed only on the first non-empty row of the table against the others.
As context information for each table we select 200 characters before and 200 after the table itself.

Preprocessing - CorrespondenceCreator

When a repository of data tables is created, the CorrespondenceCreator is automatically run. The CorrespondenceCreator finds correspondences between tables in the repository. These correspondences are later used by the Correspondence-based Search to improve its search results.

The following types of correspondences are generated:

Schema Correspondences
A Schema Correspondence marks two columns from two different tables as containing the same attribute e.g. column1 from table1 and column4 from table3 both contain company sizes.

Instance Correspondences
An Instance Correspondence marks two rows from two different tables as referring to the same real-world-object e.g. row2 from table1 and row11 from table3 both contain information about the company Tesla.

In order to find these correspondences the CorrespondenceCreator executes the following steps:

Blocking
Looking for correspondences between all tables would be too computationally expensive. In the blocking step, we therefore generate pairs of tables that are likely to have correspondences.
For a given table the five tables with the most similar subject columns are chosen as likely corresponding tables. The KeyColumnIndex is used for comparing the subject columns.

Instance-matching
The instance correspondences between two tables are found by comparing the values from a row in the first table to the values from a row in the second table. For comparing these values data-type-specific similarity metrics are used. Also, the similarity of the subject column values is given a weight of 0.5. While the remaining 0.5 weight points are equaliy divided over all other columns. Row combinations with an overall similarity score above 0.55 are considered instance matches. This threshold can be configured in the config file.

Schema-matching
Here we find corresponding columns in the two tables, by comparing their column values and column headers. When comparing the column values the knowledge about the instance-matches helps us get a better accuracy. Here too, data type-specific similarity measures are used for comparing column values.

Steps 2. and 3. of the correspondence creation process are implemented using the WInte.r Data Integration Framework.

Evaluation

In the following, we evaluate different aspects of the DS4DM backend components. First, the T2D Goldstandard is presented. This Goldstandard was used for many of the evaluations. Afterwards, the evaluations of the Constrained-, the Unconstrained- and the Correlation-based- Table Extension are presented. Finally, the Table Upload Functionality and the Preprocessing Components are evaluated.
The data from all the evaluations can be downloaded here.

T2D Goldstandard
For multiple of the evaluations, the T2D Goldstandard V2 is used. This is a collection of 779 tables that were extracted from HTML pages and cover a wide range of topics - such as populated places, organizations, people, music, etc. The tables were manually mapped to the DBpedia knowledge base. For our evaluation, we derive schema- and instance-correspondences between the different tables of the goldstandard by looking for pairs of schemas/instances that both have been mapped to the same DBpedia entry.

Evaluation of the Constrained Table Extension
The evaluation data can be downloaded from here.
We measure the density (completeness) of the attribute that is added to different query tables as well as the quality of the values of the newly added attribute. We use the T2D Goldstandard V2 for the evaluation.

For evaluating the table search methods we use fifteen different query tables that each should be extended by one specific attribute. Using the correspondences of the T2D Goldstandard, we inferred the best possible population of this extension attribute for the fifteen tables. The evaluation results below where calculated by comparing the values of the extension attributes that were added to the tables by the two search algorithms with the optimal values of the attributes as derived from the goldstandard. The complete details about the evaluation including the 15 query tables as well as the content of the attributes that were added to the tables can be found here.

Evaluation tables for the Datasearch

The table on the left contains the results obtained using the keyword-based search method. You will notice, that in general the algorithm was able to populate the extension column with a high density and accuracy. There are two exceptions for which the algorithm as not able to populate the extension column properly: Mountain-MountainRange and Film-ReleaseDate.
The table on the right shows the results of the correspondence-based search method. Especially for Film-ReleaseDate and Country-Code, due to the additionally found data, many more values in the extension column could be populated (“missing values filled” is high).
In the special cases of Country-Currency and Game-Developer, some of the additionally found data was wrong. In this case, extension-column values which had previously been correctly populated, now were wrongly populated (negative “difference in correct values after fusion” values).

Evaluation of the Unconstrained Table Extension
The evaluation data can be downloaded from here.
For the evaluation of the Unconstrained Table Extension, 13 different tables with different types of entities (Airports, Currencies, Lakes, etc.) were extended using the tables from the T2D Goldstandard as repository. The resulting extended tables were compared with ideal solutions to calculate the Precision-, Recall- and F1- scores shown below.

The ideal solutions were generated the following way:

Identify the subject-columns of the tables in the T2D Goldstandard
Determine groups of corresponding subject columns, using the known schema correspondences of the T2D Goldstandard
Select the 13 largest groups. (This is how the 13 types of entities were determined - Airports, Currencies, Lakes, etc.)
For each group: Fuse the tables whose subject column is in this group, by using the known schema- and instance- correspondences between these tables.
When, for the fusion, several values are associated with one position, all of them are kept - in the evaluation all of them are counted as correct.

Evaluation of unconstrained table extensioin

Analysis of the results
The Recall is better than the Precision for most of the examples. This is because the ideal solutions were generally not as well populated as the results of the Unconstrained Table Extension. The ideal solutions only contained values that have entity- and schema- correspondences to other tables, whereas the results used all values from the found tables.
A lot of the other differences can be explained by the quality and the similarity of the underlying tables – for Lakes and Journals the underlying tables were very different; for Animals and Airlines they were very similar.

Evaluation using Product Data from different E-Shops
Another evaluation of the Unconstrained Table Extension was performed using product data that had extracted from thirteen e-shops using a focused crawler. The Evaluation was performed in much the same way as with the T2D_Goldstandard data (see above). The evaluation data as well as a detailed description of the evaluation setup is found here.

Evaluation of unconstrained table extension with product data scraped by the focussed crawler

Evaluation of the Correlation-based Table Extension
The evaluation data can be downloaded from here.
The Evaluation of Unconstrained Table Extension was done with 13 query tables. For the evaluation of the Correlation-based Table Extension, from these thirteen tables, the four were considered that have the most numerical columns.

For these tables the additional columns found by the Correlation-based Table Extension were compared with the columns found when applying the Correlation-based Filtering to the ideal solutions.
The following results were obtained:

Precision and Recall for correlation based table extension

Analysis of the results
The scores are not 100%, because the Unconstrained Table Extension doesn’t work perfectly. The recall is better, because the tables generated by the Unconstrained Table Extension generally have more columns. So, more of columns are found to be correlating. Other things that cause wrong correlations are: columns with very low density, values with different measurement units (e.g. km2 and m2) in the same column.

Evaluation of the Table Upload Functionality
The uploadTable functionality allows users to add individual tables to an existing corpus of tables. The uploaded tables are indexes and correspondences are created between the newly uploaded table and the tables that are already in the repository. Uploading and indexing 1000 tables one by one takes approximately 2 hours (7.2 seconds per table). Alternatively, the bulkTableUpload functionality can be used for uploading larger amounts of tables. Processing the same 1000 tables using the bulkTableUpload only requires 10 minutes (0.6 seconds per table). These times were measured using a machine with 13GB of RAM and four 2.6GHz processors.

Evaluation of the Preprocessing Components

Evaluation of the Correspondence Discovery

The evaluation data can be downloaded from here.
We also use the T2D Goldstandard for evaluating the quality of the instance- and schema-level correspondences between tables that are discovered by the CorrespondenceCreator. The matching method that is employed by the CorrespondenceCreator reaches the following F1 scores:

Instance Matching
F1 = 0.753 (Precision = 0.949; Recall = 0.624)

Schema Matching
F1 = 0.755 (Precision = 0.803; Recall = 0.712)

Evaluation of the Runtime Performance of the Correspondence Discovery

The Perfomance of the Backend components was evaluated using the Wikitables dataset. This dataset consists of 1.6 million tables out of which 541 thousand are relational tables with a subject column and a minimum size of 3x3.
The IndexCreatior needs 2 hours to index these 541 thousand tables. The CorrespondenceCreator needs 4 days to process theses tables.
These times were measured using a machine with 8GB of RAM and a 3.1GHz processor.

Evaluation of Blocking Step

The evaluation data can be downloaded from here.
The CorrespondenceCreator employs a blocking step to reduce the number of table comparisons that are needed for identifying correspondences. The blocking technique clusters tables using a bag-of-words approach. When used to find likely matching pairs in the T2D Goldstandard, the blocking technique achieves a reduction ratio of 0.992 and a pair completenes of 0.701. The harmonic mean of these two values is 0.822.

DS4DM

DS4DM

About

News

DS4DM Backend Components

DS4DM Webservice

Constrained Table Extension

Keyword-based Search

Correspondence-based Search

Unconstrained Table Extension

Correlation-based Table Extension

Upload Table Functionality

Preprocessing - WebtableExtractor

Preprocessing - IndexCreator

Preprocessing - CorrespondenceCreator

Evaluation

Evaluation of the Correspondence Discovery

Evaluation of the Runtime Performance of the Correspondence Discovery

Evaluation of Blocking Step

Resources

Official Project Website

DS4DM code

DS4DM videos

Publications

People

Benedikt KleppmannData and Web Science Group, University of Mannheim, Germany

Heiko PaulheimData and Web Science Group, University of Mannheim, Germany

Chris BizerData and Web Science Group, University of Mannheim, Germany