back to list

Project: AutoML for Vision Transformers

Recently, vision transformer architecture, ViT, excels at many tasks in computer vision, such as image recognition [1], image segmentation [2], image retrieval [3], image generation [4], visual object tracking [5] or object detection [6]. However, all these different sub-tasks require domain expertise, such as the type, the arrangement and the size of layers and filters. For example, for simple recognition, coarse-level filters may suffice whereas for dense prediction tasks such as object detection or segmentation, one may require dense filters. To manually designate such architectures is very time-consuming, often leading to sub-par performance on the target task.

One can resort to neural architecture search to automatically design neural architectures: Transformer Architecture Search (TAS) [7–10]. TAS involves training several different ViTs with different configurations, and then selecting the best options for the target task(s). However, performing TAS has huge compute demand: Longer than 24 GPU days, using 8 GPUs in an efficient distributed compute server, rendering it infeasible in an academic environment.

To that end, in this project, we will perform training-free transformer architecture search, which reduces the need for compute from 24 GPU days to just 0.5 GPU days. More specifically, we will rely on the synaptic diversity and saliency indicator proposed in [11] as a starting point.

Later, we plan to incorporate better measures of saliency and diversity that reflects the performance of a potential ViT architecture. A potential improvement would involve leveraging unsupervised performance predictors to discover more transferable vision transformers [12].

[1] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn,Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint
arXiv:2010.11929, 2020.

[2] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.

[3] Shiv Ram Dubey, Satish Kumar Singh, and Wei-Ta Chu. Vision transformer hashing for image retrieval. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2022.

[4] Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, and Ce Liu. Vitgan: Training gans with vision transformers. arXiv preprint arXiv:2107.04589, 2021.

[5] Fangao Zeng, Bin Dong, Tiancai Wang, Xiangyu Zhang, and Yichen Wei. Motr: End-to-end multiple-object tracking with transformer. arXiv preprint arXiv:2105.03247, 2021.

[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.

[7] David So, Quoc Le, and Chen Liang. The evolved transformer. In International Conference on Machine Learning, pages 5877–5886. PMLR, 2019.

[8] Xiu Su, Shan You, Jiyang Xie, Mingkai Zheng, Fei Wang, Chen Qian, Changshui Zhang, Xiaogang Wang, and Chang Xu. Vision transformer architecture search. arXiv e-prints, pages arXiv–2106, 2021.

[9] Boyu Chen, Peixia Li, Chuming Li, Baopu Li, Lei Bai, Chen Lin, Ming Sun, Junjie Yan, and Wanli Ouyang. Glit: Neural architecture search for global and local image transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–21, 2021.

[10] Minghao Chen, Houwen Peng, Jianlong Fu, and Haibin Ling. Auto-former: Searching transformers for visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12270–12280, 2021.

[11] Qinqin Zhou, Kekai Sheng, Xiawu Zheng, Ke Li, Xing Sun, Yonghong Tian, Jie Chen, and Rongrong Ji. Training-free transformer architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10894–10903, 2022.

[12] Anonymous. Unsupervised performance predictor for architecture search.
In ICLR Submission, 2023.
Joaquin Vanschoren
Secondary supervisor
Mert Kilickaya
Get in contact