With the recent success of LLMs, and the strong potential of multi-modal learning from both text and vision, several works have framed images as sequences to conform with generative sequence-to-sequence encoder-decoder or decoder based transformers [1]. Such formulations present advantages such as unified architectures and unified inference procedures for several vision and language tasks [7].
In self supervised text representation learning, text can naturally be represented as a sequence. Conditioning on already predicted words to predict the next word makes intuitive sense. Similarly, self supervised image representation learning of sequence-to-sequence models would require predicting image patch tokens in a sequence. Image patches can simply be sequenced by grid positions. However, unlike text, image tokens have no inherent order. While each token is related to others in the image, the sequence can arbitrarily start at any grid position. Further, a patch in a grid position doesn’t necessarily have to be predicted right before or right after an adjacent patch.
When looking at an image, humans likely do not move row by row from top left to bottom right of the image. Instead, we tend to first focus directly on the object of interest [6]. In human vision, foveation moves the visual attention to different points within the scene called fixation points through movements called saccades. A method that draws inspiration from foveation to determine decoding sequence of image tokens could be interesting to explore.
Recent works explore autoregressively predicting permutations of image patches [4, 2, 6, 5]. Qi et al. [4] sample a random permutation order which determines the autoregressive prediction sequence. They use reconstruction loss of all patches as the learning objective. Hua et al. [2] group patches into segments. Patches within segments are predicted in parallel and the segments are predicted in a random sequence. Song et al. [6] intend to predict semantically rich patches first followed by patches with low semantic information.
Other works also explore the use of non-autoregressive transformers (NATs) due to the advantage of reduced inference time [3]. Exploring these works can also be useful to pit against autoregressive inference (Relevant works directly addressing image representation learning, if any, needs to be identified).
The thesis work can center around autoregressive predictions of unordered/permuted image patches for representation learning and/or other methods. To formulate changes/improvements to existing works, drawing inspiration from the literature that discuss how human brain decodes images could be explored. Additionally, the thesis could study multi-modal image and text autoregressive representation learning.
Primary contact - Naresh Kumar Gurulingan (n.k.gurulingan@tue.nl)
References:
[1] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020.
[2] Tianyu Hua, Yonglong Tian, Sucheng Ren, Michalis Raptis, Hang Zhao, and Leonid Sigal. Self-supervision through random segments with autoregressive coding (randsac). arXiv preprint arXiv:2203.12054, 2022.
[3] Zanlin Ni, Yulin Wang, Renping Zhou, Jiayi Guo, Jinyi Hu, Zhiyuan Liu, Shiji Song, Yuan Yao, and Gao Huang. Revisiting non-autoregressive transformers for efficient image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7007–7016, 2024.
[4] Yu Qi, Fan Yang, Yousong Zhu, Yufei Liu, Liwei Wu, Rui Zhao, and Wei Li. Exploring stochastic autoregressive image modeling for visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2074–2081, 2023.
[5] Sucheng Ren, Zeyu Wang, Hongru Zhu, Junfei Xiao, Alan Yuille, and Cihang Xie. Rejuvenating image-gpt as strong visual representation learners. In Forty-first International Conference on Machine Learning, 2023.
[6] Kaiyou Song, Shan Zhang, and Tong Wang. Semantic-aware autoregressive image modeling for visual representation learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4925–4933, 2024.
[7] Haiyang Wang, Hao Tang, Li Jiang, Shaoshuai Shi, Muhammad Ferjad Naeem, Hong-sheng Li, Bernt Schiele, and Liwei Wang. Git: Towards generalist vision transformer through universal language interface. arXiv preprint arXiv:2403.09394, 2024.