back to list

Project: Correlation Detective on streaming data

Description

Correlations are extensively used in all data-intensive disciplines, to identify relations between the data (e.g., relations between stocks, or between medical conditions and genetic factors). The 'industry-standard' correlations are pairwise correlations, i.e., correlations between two variables. 

Multivariate correlations are correlations between three or more variables. Compared to pairwise correlations, multivariate correlations are more expressive, and have been repeatedly proven instrumental in the last years, for understanding the data extracting key insights, and enabling data-driven innovation. However, discovery of these correlations is very computationally-intensive, and often requires super-computers, large clusters, or cloud resources. To date, Correlation Detective is the state-of-the-art system for finding multivariate correlations. It is an open-source system, built and maintained at the TU/e, and used in projects with industry. The system also allows processing of data streams, and maintaining the solutions. 

 

In this project, you will develop a scalable version of the streaming Correlation Detective, by distributing it over an MPP platform (Spark, Flink). The goal is to have a system that can easily scale to handle faster and more streams, just by adding more computers.

 

Prerequisites: ability to write efficient code in Java or Scala, successful completion of 2AMD15.

 

Reading material:

Correlation Detective website: https://corrdetective.github.io

Paper describing Correlation Detective: https://vldb.org/pvldb/vol15/p1266-papapetrou.pdf

Details
Supervisor
Odysseas Papapetrou