Data and AI cluster

Project: From Tables to Pixels: Adapting Tabular Foundation Models for Vision Tasks

Description

Foundation models have recently demonstrated remarkable capabilities across a wide range of domains by learning from large-scale data and generalizing to novel, unseen tasks without the need for fine-tuning. This generalization ability is primarily enabled by their capacity for in-context learning, which is the ability to extract knowledge from a small context and make accurate predictions on new input data with a single forward pass. Some notable examples of this approach are TabPFN [1], in the tabular domain, and CAML [2] and CAMeLU [3], in the vision domain. These models are trained on a large number of synthetic classification tasks using transformer-based architectures and are capable of few-shot inference without gradient updates. TabPFN, in particular, demonstrates the power of this approach by achieving high performance across diverse tasks—including tabular classification and time-series forecasting [4]—after pretraining on millions of synthetic tabular datasets. The resulting model is capable of zero-shot prediction on small datasets with high accuracy and computational efficiency.

This project explores the extension of TabPFN’s capabilities beyond the tabular domain to visual tasks. The goal is to investigate whether a foundation model like TabPFN, originally trained on synthetic tabular data, can be adapted, or even directly transferred, to image classification scenarios. To support this, the project will incorporate techniques from context-aware meta-learning approaches such as CAML [2] and CAMeLU [3], to generate synthetic visual tasks. Both supervised and unsupervised strategies will be considered, including the development of a vision-specific variant of TabPFN and an evaluation of cross-domain transferability from tabular to visual data.

The main objectives of this project are:

- Model design: Adapting the TabPFN architecture to accept visual inputs, potentially by integrating a visual backbone (e.g., ResNet-18, CLIP, etc.).

- Implementation: Develop both supervised and unsupervised training pipelines for vision tasks using a foundation model trained on tabular data.

- Evaluation: Test the model’s performance on few-shot learning benchmarks. Assess its ability to generalize to vision tasks and analyze the strengths and limitations of this cross-domain application.

References:

[1] Hollmann, Noah, et al. "Accurate predictions on small data with a tabular foundation model." Nature 637.8045 (2025): 319-326.

[2] Fifty, Christopher, et al. "Context-Aware Meta-Learning." The Twelfth International Conference on Learning Representations.

[3] Vettoruzzo, Anna, et al. "Unsupervised meta-learning via in-context learning." The Thirteenth International Conference on Learning Representations. 2025.

[4] Hoo, Shi Bin, et al. "The tabular foundation model TabPFN outperforms specialized time series forecasting models based on simple features." NeurIPS Workshop on Time Series in the Age of Large Models. 2024

Details

Supervisor: Joaquin Vanschoren
Secondary supervisor: Anna Vettoruzzo
Interested?: Get in contact