Minwise sampling (or MinHash) is a collection of methods that estimate similarity between sets. Most methods assume static data. A new method, designed last year in our group, also works with non-static (i.e., streaming) data, and it can support deletion.
This thesis will focus on exploring applications where such a dynamic similarity algorithm is valuable (e.g., evolving networks, streaming data, content moderation) and on adjusting the method to fit the task at hand and benchmarking the method against current state-of-the-art alternatives. The project is ideal for students interested in scalable algorithms, experimental evaluation and bridging theory with impactful applications.
Odysseas Papapetrou