Data and AI cluster

Project: Exploring Prompt Level Distillation as an alternative to fine-tuning for Multi-Modal Scientific Question Answering

Description

Large Language Models use reasoning and chain of thought to reason about complex inputs and patterns. The larger models, in terms of parameters, excel at reasoning tasks and also in generating detailed chain of thought traces but their smaller variants often struggle to reason on complex tasks. While it may be reasonable to use the much larger models, it is not always feasible. These larger models require a lot of compute which is not available to everyone and as such, they need to be used via third party inference providers or the model makers themselves. This may not pose a problem for public facing products such as chat interfaces e.g. ChatGPT, Claude etc., however, workflows which have strict data security and privacy requirements can not use such setups. It may also happen that the larger models, being generalists, are not adept at very niche or domain specific tasks, especially if the task was not properly represented in the training data of the said model. For example, question answering in the medical domains where factuality and correctness are strict requirements. One way to adapt smaller LLMs to specific domains is to fine tune them on the domain specific data. However, fine tuning is a compute intensive task and requires adequate data to get tangible results. What many domains have are very well defined small scale datasets which are easier for humans to comprehend but if the same datasets are used to fine tune some LLM, due to the size of the datasets, they may not generalise well. To be specific, the fine tuned models may not capture the patterns or nuances from the small scale datasets. So, as an alternative to fine tuning, we can use knowledge distillation [1], where we use a larger model to teach a smaller student model the patterns in complex data. Even then, we have to train the smaller model. So the compute problem remains. This problem becomes even more acute when one wants to move from text only domains to multi-modal domains, using Vision and Language Models (VLMs).

This project aims to explore knowledge distillation on VLMs from a different perspective. Larger models are remarkably adept at information extraction and structuring data. In the language domain, prompt level distillation[2] has shown promising results to completely bypass fine tuning or training at all for knowledge distillation and instead use a teacher model to write optimal prompts for a student model to infer complex data. We would like to build on the methodologies in the paper and explore how effective prompt level distillation might be for multi-modal tasks. To be specific, we want to evaluate prompt level distillation on multi-modal scientific question answering using these two datasets: ScienceQA [3] and SPIQA[4]. Ultimately we would like to do the following:

Explore the efficacy of prompt level distillation on multi-modal scientific question answering.

Given we have positive indications that prompt level distillation works for multi-modal scientific question answering, we would like to build an open source framework around it so that others can build their solutions on top of it. This step is optional however, as writing usable and properly documented software takes time and also, this will only be feasible if we have positive results from our experiments.

References:

Hinton, Geoffrey E., Oriol Vinyals and Jeffrey Dean. “Distilling the Knowledge in a Neural Network.” ArXiv abs/1503.02531 (2015)
Badhe, Sanket, and Deep Shah. “Prompt-Level Distillation: A Non-Parametric Alternative to Model Fine-Tuning for Efficient Reasoning.” arXiv:2602.21103. Version 1. Preprint, arXiv, February 24, 2026. https://doi.org/10.48550/arXiv.2602.21103.
Lu, Pan, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark and A. Kalyan. “Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering.” ArXiv abs/2209.09513 (2022).
Pramanick, Shraman, Rama Chellappa and Subhashini Venugopalan. “SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers.” ArXiv abs/2407.09413 (2024).

Details

Supervisor: Joaquin Vanschoren
Secondary supervisor: Shawon Ashraf
Interested?: Get in contact