back to list

Project: Semantic Hashing of the Epstein Files

Description

Large investigative document releases often contain tens of thousands of heterogeneous files: transcripts, motions, scanned exhibits, emails, duplicates, partially redacted documents, and large amounts of procedural boilerplate. In practice, sheer volume can become a filtering mechanism. When everything is available, nothing is easily accessible. Redundancy, noise, and irrelevant material make it extremely difficult to identify the important information.

This project uses the publicly available Epstein files to create data structures for efficient information retrieval. The goal is to construct a representation of the corpus in which semantic similarity is made explicit, such that documents that discuss similar topics are close to each other. Instead of relying purely on keyword overlap, the project aims to capture topical relatedness in a way that facilitates understanding provided information by providing context over the files that are located nearby.

The resulting structure should enable a visual map of the corpus, where coherent topical regions emerge and browsing becomes topic-driven rather than keyword-driven. A central challenge is separating signal from procedural noise and redundant material, so that the learned structure provides a clear advantage over the existing keyword-based search interface.

Technically, the project evaluates semantic hashing as a mechanism for constructing such a structure. Semantic hashing maps documents to compact binary codes that preserve semantic similarity. A central component of the project is the design and analysis of the similarity metric itself. The focus will be exclusively on text documents, which already pose a substantial challenge due to the wide diversity in writing style, tone, and content across numerous emails, reports, and legal filings.

Beyond the methodological contribution, the practical goal is to provide a prototype search and exploration tool that could, for example, support investigative journalists in navigating large document dumps more effectively.

Important note: The Epstein files contain disturbing and sensitive material. This project requires emotional resilience and the willingness to manually inspect difficult content. Please consider carefully whether you are comfortable working with such material before choosing this topic.

Details
Supervisor
Sibylle Hess
Interested?
Get in contact