The field of artificial intelligence has seen unprecedented growth in recent years, particularly with the advent of foundation models and large language models (LLMs). These models have showcased remarkable capabilities across a broad spectrum of applications, including natural language processing and multimodal tasks. Traditionally, vision models have relied solely on visual data for training. However, the human brain’s ability to learn and generalize effectively is inherently multimodal, with language playing a pivotal role in enhancing cognitive processes [1]. This thesis aims to explore the potential ways of leveraging rich language semantics [2, 3, 4]. By leveraging the capabilities of LLMs, we hypothesize that we can gain in-depth insights into network behavior and training schemas, using this knowledge to develop better AI systems . Furthermore, we aim to explore the integration of this understanding into training, leading to models that exhibit improved generalization, robustness, and the ability to learn continually.
There are different explorartion areas. We will explore how the integration of language information during the training of vision models can enhance their performance [5, 6]. Specifically, we will examine the impact of using LLMs to infuse semantic richness into the visual learning process. or instance, by aligning visual features with corresponding language descriptions or captions, or by exploring the similarity of samples in the language domain, we aim to create networks that are more context-aware and capable of better understanding. We will investigate whether the integration of language semantics can make networks more robust against adversarial attacks and shortcut learning. The hypothesis is that grounding visual features in language can provide additional context and redundancy, which helps the model maintain high performance even when faced with challenging or misleading inputs. Beyond enhancing the training process, LLMs can be leveraged to gain deeper insights into the functioning of models. By using LLMs to analyze the training dynamics and network behaviors, we can uncover patterns and dependencies that are not easily observable through traditional methods. This understanding can inform the design of new network architectures and training paradigms
By the end of this thesis, we aim to provide comprehensive insights into the benefits and challenges of leveraging LLMs to understand the workings of the network and also design new trainign paradigms. Our research could pave the way for the development of more versatile and resilient AI models that better mimic the human ability to learn and generalize better
Primary contact - Shruthi Gowda (s.gowda@tue.nl)
References
[1] Anirudh Goyal and Yoshua Bengio. Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A, 478(2266):20210068, 2022.
[2] Alec Radford, Jong Wook Kim, Karthik Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
[3] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
[4] Karan Desai and Justin Johnson. Virtex: Learning visual representations from textual annotations. arXiv preprint arXiv:2006.06666, 2021.
[5] Mohamed El Banani, Karan Desai, and Justin Johnson. Learning visual representations via language-guided sampling. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 19208–19220, 2023.
[6] Jack Merullo, Louis Castricato, Carsten Eickhoff, and Ellie Pavlick. Linearly mapping from image to text space. In The Eleventh International Conference on Learning Representations, 2022.