Data and AI cluster

Project: Lagged multivariate correlations

Description

Correlations are extensively used in all data-intensive disciplines, to identify relations between the data (e.g., relations between stocks, or between medical conditions and genetic factors). The 'industry-standard' correlations are pairwise correlations, i.e., correlations between two variables. Multivariate correlations are correlations between three or more variables. Compared to pairwise correlations, multivariate correlations are more expressive, and have been repeatedly proven instrumental in the last years, for understanding and describing the data, extracting key insights, and enabling data-driven innovation.

Existing algorithms for detection of multivariate correlations only consider time-aligned correlations. For instance, in the context of finance where the data can be time series of stock prices, a time-aligned correlation between two time series (e.g., the closing prices of MSFT and IBM) means that any fluctuation – increase or decrease – at the first time series is likely to be (almost) immediately repeated to the second time series, and vice versa. In practice, though, we see that many correlations are lagged correlations, i.e., they appear only after you consider the possibility of a delay. In the example presented above, a fluctuation at the first time series might take a few days to repeat to the second time series.

In this thesis: (a) you will study the techniques for lagged pairwise correlations, (b) you will build your own method (possibly by extending/combining the pairwise methods) for lagged multivariate correlations.

Prerequisites: ability to write efficient code in *Java or Scala*, comfortable with mathematical proofs, ability to read and understand scientific literature (conference papers and journal articles), successful completion of 2AMD15 with a high grade.

Details

Supervisor: Odysseas Papapetrou