Data and AI cluster

Project: Vision-centric image tokenization in the generative transformer era

Description

Generative autoregressive next token prediction has shown impressive success in LLMs. Several works have attempted to extend the success of LLMs to vision-language tasks with VLMs. While a VLM can be designed specifically for image-to-text tasks like visual question answering, many works also attempt to represent visual tokens similar to text tokens to enable tasks that require image generation.

Text can naturally be seen as a sequence of discrete semantic words lending to text tokens being represented using discrete vocabulary embeddings. To leverage this success of discrete vocabularies for vision, using the learned latent codebook of a trained VQ-VAE has emerged as a tokenization approach [5]. However, the learned codebooks of VQ-VAE could excessively represent low-level details instead of semantic information [3]. The reconstruction objective might lead to learning of uninformative features [1]. On top of this disadvantage, there might not be a need to represent inherently continuous-valued images as discrete tokens.

Recent works have shown that discrete tokens are not a necessity for autoregressive transformers and that images can be represented using continuous-valued tokens. Li et al. [6] show that a continuous-valued latent vector can be used as a condition in a diffusion network to generate an image token. This diffusion network is jointly trained with the autoregressive model. Tschannen et al. [7] propose GIVT and show that the output distribution can be modeled as a Gaussian mixture model instead of a categorical distribution thereby removing the need for quantization in the latent space. These approaches show evidence that continuous-valued image tokens are feasible.

In addition to using continuous-valued tokens, the semantic information in each token can be enriched. Jin et al. [4] propose a method where important image tokens are selected, and the information from the rest of the tokens is incorporated into these tokens. The number of image tokens is dynamic (changes based on the image) which could lead to inference cost savings and potential training improvements. Other works also discuss extracting semantic tokens from images [2, 8].

The thesis can explore these diverse types of image tokenizers and gather their known strengths and pitfalls. Based on this exploration, the thesis could propose and evaluate specific improvements to one or more tokenizers. Alternatively, all tokenizers can be empirically evaluated in a unified experimental setting. Based on this setting, insights such as adversarial robustness of image + text generative models [9] could be provided.

Primary contact - Naresh Kumar Gurulingan (n.k.gurulingan@tue.nl)

References:

[1] Randall Balestriero and Yann LeCun. How learning by reconstruction produces uninformative features for perception. In Forty-first International Conference on Machine Learning, 2024.

[2] Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, and Ying Shan. Making LLaMA SEE and draw with SEED tokenizer. In The Twelfth Inter- national Conference on Learning Representations, 2024.

[3] Yuchao Gu, Xintao Wang, Yixiao Ge, Ying Shan, and Mike Zheng Shou. Rethinking the objectives of vector-quantized tokenizers for image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7631–7640, 2024.

[4] Yang Jin, Kun Xu, Kun Xu, Liwei Chen, Chao Liao, Jianchao Tan, Quzhe Huang, Bin CHEN, Chengru Song, dai meng, Di ZHANG, Wenwu Ou, Kun Gai, and Yadong MU. Unified language-vision pretraining in LLM with dynamic discrete visual tokenization. In The Twelfth International Conference on Learning Representations, 2024.

[5] Alexander Kolesnikov, André Susano Pinto, Lucas Beyer, Xiaohua Zhai, Jeremiah Harmsen, and Neil Houlsby. Uvim: A unified modeling approach for vision with learned guiding codes. Advances in Neural Information Processing Systems, 35:26295–26308, 2022.

[6] Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization. arXiv preprint arXiv:2406.11838, 2024.

[7] Michael Tschannen, Cian Eastwood, and Fabian Mentzer. Givt: Generative infinite-vocabulary transformers. arXiv preprint arXiv:2312.02116, 2023.

[8] Tao Yang, Yuwang Wang, Yan Lu, and Nanning Zheng. Visual concepts tokenization. Advances in Neural Information Processing Systems, 35:31571–31582, 2022.

[9] Yunqing Zhao, Tianyu Pang, Chao Du, Xiao Yang, Chongxuan Li, Ngai man Cheung, and Min Lin. On evaluating adversarial robustness of large vision-language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.

Details

Student: NG
Naresh Kumar Gurulingan
Supervisor: Bahram Zonooz
Secondary supervisor: Elahe Arani
Interested?: Get in contact