Recent work in representation learning—especially in interpretability research—frequently refers to monosemanticity: the idea that individual units (neurons, features, or directions in representation space) correspond to a single, well-defined concept. Closely related is the notion of disentanglement, where different latent dimensions are expected to capture independent factors of variation.
However, existing metrics for disentanglement and monosemanticity often measure proxy quantities such as statistical independence, sparsity, predictability, or feature separability. While useful, these metrics do not directly capture what researchers informally mean by “one concept per feature.” In practice, they evaluate structural properties of representations rather than semantic coherence.
This thesis aims to critically analyze existing monosemanticity and disentanglement metrics and investigate whether they meaningfully capture the notion they claim to measure.
The project will:
The central research question is:
What would it mean, formally, for a feature to be monosemantic, and how could such a property be measured?
Depending on scope and progress, the thesis may also attempt to propose and evaluate improved metrics grounded in clearer theoretical criteria. Possible directions include defining axiomatic properties for monosemanticity scores, robustness under basis transformations, or invariance to reparameterizations of the representation.
This project is conceptually demanding and requires mathematical maturity. It is particularly suited for students with a strong interest in representation learning, interpretability, and the foundations of machine learning.
Sibylle Hess
Surja Chaudhuri