back to list

Project: Towards Meaningful Metrics for Monosemanticity and Disentanglement

Description

Recent work in representation learning—especially in interpretability research—frequently refers to monosemanticity: the idea that individual units (neurons, features, or directions in representation space) correspond to a single, well-defined concept. Closely related is the notion of disentanglement, where different latent dimensions are expected to capture independent factors of variation.

However, existing metrics for disentanglement and monosemanticity often measure proxy quantities such as statistical independence, sparsity, predictability, or feature separability. While useful, these metrics do not directly capture what researchers informally mean by “one concept per feature.” In practice, they evaluate structural properties of representations rather than semantic coherence.

This thesis aims to critically analyze existing monosemanticity and disentanglement metrics and investigate whether they meaningfully capture the notion they claim to measure.

The project will:

  • Systematically review existing metrics (e.g., independence-based, mutual-information-based, classifier-based scores).
  • Analyze the assumptions these metrics implicitly make about concepts and semantics.
  • Construct counterexamples where current metrics score highly despite clearly polysemantic features.
  • Investigate formal desiderata that a monosemanticity metric should satisfy.

The central research question is:

What would it mean, formally, for a feature to be monosemantic, and how could such a property be measured?


Depending on scope and progress, the thesis may also attempt to propose and evaluate improved metrics grounded in clearer theoretical criteria. Possible directions include defining axiomatic properties for monosemanticity scores, robustness under basis transformations, or invariance to reparameterizations of the representation.

This project is conceptually demanding and requires mathematical maturity. It is particularly suited for students with a strong interest in representation learning, interpretability, and the foundations of machine learning.

Details
Supervisor
Sibylle Hess
Secondary supervisor
Surja Chaudhuri
Interested?
Get in contact