back to list

Project: Multivariate correlations for data cleaning

Description

Correlations are extensively used in all data-intensive disciplines, to identify relations between the data (e.g., relations between stocks, or between medical conditions and genetic factors). The 'industry-standard' correlations are pairwise correlations, i.e., correlations between two variables. 

Multivariate correlations are correlations between three or more variables. Compared to pairwise correlations, multivariate correlations are more expressive, and have been repeatedly proven instrumental in the last years, for understanding the data extracting key insights, and enabling data-driven innovation. However, discovery of these correlations is very computationally-intensive, and often requires super-computers, large clusters, or cloud resources. To date, Correlation Detective is the state-of-the-art system for finding multivariate correlations. It is an open-source system, built and maintained at the TU/e, and used in projects with industry.

 

In this project, you will use Correlation Detective to discover multivariate correlations on noisy datasets, and explore whether imputation of data and other data cleaning operations can be improved with the use of multivariate correlations.

 

Prerequisites: ability to write efficient code in Java, Scala, or Python

Recommended: successful completion of 2AMD15.

 

Reading material:

Correlation Detective website: https://corrdetective.github.io

Paper describing Correlation Detective: https://vldb.org/pvldb/vol15/p1266-papapetrou.pdf

Details
Supervisor
Odysseas Papapetrou