Training ML models over big data is a time-consuming and energy-hungry process. Furthermore it requires full access over the data, which is challenging in many use cases, due to the size of the data. The problem is particularly challenging when the data is read by (potentially unbounded) streams. To alleviate this high complexity, several approaches have been considered in the past, including random sampling, and dimensionality rejection techniques such as random projections, e.g., [1,2,3,4]. These approaches aim at reducing the size of the input data (either vertically, by reducing the training samples, or horizontally, by reducing the number of dimensions) to make both storage and training more efficient.
In this thesis you will build on this line of work, by further examining the tradeoff between space complexity (i.e., how is the training data represented) and accuracy. You are expected to investigate the use of different sketches, and different configurations, for training different models. The goal is to understand what types of sketches can help for each type of model.
[1] http://archive.dimacs.rutgers.edu/Research/MMS/PAPERS/rp.pdf
[2] https://ieeexplore.ieee.org/document/7796904
[3] https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10494875
[4] http://proceedings.mlr.press/v99/calandriello19a/calandriello19a.pdf