Data Search For Data Mining
Search and Integrate Tabular Data
Data Search For Data Mining
Search and Integrate Tabular Data
The BMBF-funded research project Data Search for Data Mining (DS4DM) extends the data mining plattform RapidMiner with data search and data integration functionalities which enable analysts to find relevant data in large corpora of tabular data, and to semi-automatically integrate discovered data with existing local data. Specifically, the data search functionalities allows users to extend an existing local data table with additional attributes (=columns) of their choice. E.g. if you have a table with data about companies loaded into RapidMiner, you can ask the DS4DM RapidMiner Data Search Operator to extend your table with the additional column 'Headquarter location'. The operator will search the corpus of tabular data, that is loaded into the DS4DM backend, for relevant data, will add a new 'Headquarter location' column to the user's table, and will populate the column with values from the data corpus.
As well as three pre-processing components:
The webservice offers data search as well as repository maintainance functionality via a REST API. The full API specification is here.
The most important methods are described in detail below:
The keyword-based search method allows the user to extend a given table with an additional column. This is done by finding the data necessary for populating this additional column within a repository of data tables.
Lets assume that a user has loaded a table describing countries into RapidMiner. The table has the two columns "country" and "GDP (millions of US$)". The user now wants to have the additional column "population" added to the table and filled with data from the corpus - see stage ⓪ in figure below. In order to retrieve tables ①, the keyword-based search algorithm searches for tables in the repository, that have a column with a column name similar to "population". Afterwards ②, the algorithm determines correspondences between instances (rows) in the found tables and rows the query table, by comparing the subject-column values of these two tables – in our case: comparing the names of the countries. For the comparison, the subject column values are normalized and compared afterwards using Fuzzy-Jaccard with a threshold of 0.75. The threshold can be changed in the config-file. The identification of the subject columns is done by a combination of the same string comparison on column headers and a subject-column-detection algorithm .
The correspondence-based search method performs the same task as the keyword-based search: find tables that are relevant for table expansion within a repository of tabular data. The correspondence-based search however aims at achieving complete results by employing pre-calculated correspondences between the tables in the repository - see ⓪ in the figure below. These schema- and instance- correspondences between tables in the repository are calculated during the creation of a repository. The method that is used to discover the correspondences is described below.
The correspondence-based search addresses following limitation of the keyword-based search: If you want extend your table with the additional column “population”, the keyword-based search will not return tables containing an "inhabitants" column as the name of this column is too different from the name “population”. Nevertheless, the column "inhabitants" may contain data that is relevant for extinding the query table.
The correspondence-based search method deals with this problem by using pre-calculated correspondences to expand the query result and return a more complete set of relevant tables: Initially, the method employs the same approach as the keyword-based method to find an initial set of tables that contain a column having a name similar to "population". Afterwards, the method searches the pre-calculated schema correspondences for correspondences that connect the discovered population columns with population columns in additional tables which have not been identified based on the column name yet. For instance, as the method for generating the correspondences also considers data values and not only column names, a pre-calculated correspondence could connect an “2012 Estimate” column in another table with a population column in one of the discovered tables and the table containing the “2012 Estimate” column would consequently also be added to the set of relevant tables ①. The pre-calculated correspondences also come to play for finding instance correspondences. In addition to comparing subject column values using string similarity - as in the keyword-based search – the pre-calculated instance correspondences are used to identify rows in the additional tables that describe entities which also appear in the query table ②. Again, as the instance correspondences are pre-calculated by not only comparing subject column values but also considering the values other columns, the set of instance correspondences that is returned to RapidMiner is more complete that the set that results from only comparing subject column values.
Many use cases not only require extending local tables with data from public webtables, but users would like to extend tables using their own data, for example internal data from within a compary.
The UploadTable functionality allows users to create their own table repositories on the server and upload a tables into these repositories. The backend indexes the uploaded tables using the IndexCreator Component. The backend also uses the CorrespondenceCreator Component to discover instance- and schema-correspondences between uploaded tables and tables that are already contained in a repository. These correspondences are employed afterwards by the correspondence-based search.
The WebtableExtractor extracts tables from HTML pages and transforms them into the representation that is used by the other components of the backend.
The Extractor is currently implemented as a batch process. We assume that input data is stored locally, e.g. in the form of HTML pages.
The process iterates over the locally stored pages, extracts useful tables and represent the output in a standardised format. The extraction is performed with BasicExtraction algorithm which iterates through tables in the HTML page using the "table" tag. Heuristics are used to discard noisy tables:
The remaining tables are classified as "layout" or "content" tables. The classification is done with the classifyTable method. It uses two models (SimpleCart_P1 and SimpleCart_P2). The first model identifies if the table is a LAYOUT table (only used for visualization formatting purposes). If this is not the case, then the second model is used to classify a table as:
For all retained tables, the method additionally identifies:
The key column detection selects the column with the maximal number of unique values. In case of a tie, the left-most column is used. The header detection identifies a row which has a different content pattern for the majority of its cells, with respect to the other rows. Currently this test is performed only on the first non-empty row of the table against the others. As context information for each table we select 200 characters before and 200 after the table itself.
The DS4DM webservice has to be fast at identifying the correct tables to return to the DS4DM RapidMiner operator. To achieve this speed despite the large corpus of data tables it has to search through, indexes are needed. In total 3 lucene indexes are used; they contain information about the data tables in the webservice's corpus.
The IndexCreator creates the following 3 Lucene indexes:
When a repository of data tables is created, the CorrespondenceCreator is automatically run. The CorrespondenceCreator finds correspondences between tables in the repository. These correspondences are later used by the Correspondence-based Search to improve its search results.
The following types of correspondences are generated:
In order to find these correspondences the CorrespondenceCreator executes the following steps:
Step 2 and 3 of the correspondence creation process are implemented using the WInte.r Data Integration Framework.
In the following, we evaluate different aspects of the DS4DM backend components. First, we evaluate the keyword-based and correspondence-based search methods. For this we measure the density of the attributes that are added to various query tables as well as accuracy of the values of the newly added attributes. Afterwards, we turn to the correspondence discovery and evaluate the quality of the generated correspondences, the runtime needed to generate the correspondences as well as the effectiveness of the blocking technique that is used as part of the correspondence discovery process.
Evaluation of the Table Search Methods
We measure the density (completeness) of the attributes that are added to different query tables as well as the quality of the values of the newly added attributes. We use the T2D Goldstandard V2 for the evaluation. The goldstandard consists of 779 tables that were extracted from HTML pages and cover a wide range of topics - such as populated places, organizations, people, music, etc. The tables were manually mapped to the DBpedia knowledge base. For our evaluation, we derive schema- and instance-correspondences between different tables within the goldstandard by looking for pairs of schemas/instances that both have been mapped to the same DBpedia entry.
For evaluating the table search methods we use 15 different query tables that each should be extended by one specific attribute. Using the correspondences of the T2D Goldstandard, we inferred the best possible population of the extension attributes of these fifteen tables. The evaluation results below where calculated by comparing the values of the extension attributes that were added to the tables by the two search algorithms with the optimal values of the attributes as derived from the goldstandard. The complete details about the evaluation including the 15 query tables as well as the content of the attributes that were added to the tables can be found here.
The table on the left contains the results obtained using the keyword-based search method. You will notice, that in general the algorithm was able to populate the extension column with a high density and accuracy. There are two exceptions for which the algorithm as not able to populate the extension column properly: Mountain-MountainRange and Film-ReleaseDate.
The table on the right shows the results of the correspondence-based search method. Especially for Film-ReleaseDate and Country-Code, due to the additionally found data, many more values in the extension column could be populated (“missing values filled” is high).
In the special cases of Country-Currency and Game-Developer, some of the additionally found data was wrong. In this case, extension-column values which had previously been correctly populated, now were wrongly populated (negative “difference in correct values after fusion” values).
Evaluation of the Correspondence Discovery
We also use the T2D Goldstandard for evaluating the quality of the instance- and schema-level correspondences between tables that are discovered by the CorrespondenceCreator. The matching method that is employed by the CorrespondenceCreator reaches the following F1 scores:
Evaluation of the Runtime Performance of the Correspondence Discovery
The Perfomance of the Backend components was evaluated using the Wikitables dataset. This dataset consists of 1.6 million tables out of which 541 thousand are relational tables with a subject column and a minimum size of 3x3.
The IndexCreatior needs 2 hours to index these 541 thousand tables. The CorrespondenceCreator needs 4 days to process theses tables.
These times were measured using a machine with 8GB of RAM and a 3.1GHz processor.
Evaluation of Blocking Step
The CorrespondenceCreator employs a blocking step to reduce the number of table comparisons that are needed for identifying correspondences. The blocking technique clusters tables using a bag-of-words approach. When used to find likely matching pairs in the Wikitables dataset, the blocking technique achieves a reduction ratio of 0.992 and a pair completenes of 0.701. The harmonic mean of these two values is 0.822.
Evaluation of the Table Upload Functionality
The uploadTable functionality allows users to add individual tables to an existing corpus of tables. The uploaded tables are indexes and correspondences are created between the newly uploaded table and the tables that are already in the repository. Uploading and indexing 1000 tables one by one take approximately 2 hours (7.2 seconds per table). Alternatively, the bulkTableUpload functionality can be used for uploading larger amounts of tables. Processing the same 1000 tables using the bulkTableUpload only requires 10 minutes (0.6 seconds per table). These times were measured using a machine with 13GB of RAM and four 2.6GHz processors.