Self-supervised learning [1, 2] solves pretext prediction tasks that do not require annotations in order to learn feature representations. Recent empirical research has demonstrated that deeper and wider models benefit more from task-agnostic use of unlabeled data than their smaller counterparts; i.e., smaller models when trained using self-supervised learning fail to close the gap in comparison to supervised training. Existing techniques, such as contrastive learning, do not perform well empirically in small networks. This is primarily because there are fewer parameters in smaller models, which cannot effectively learn the discriminative representation at the instance level with a large amount of data.
Knowledge distillation [7] is an effective technique to improve the performance of compact models, either by using the supervision of a larger pre-trained model or with a cohort of smaller models trained collaboratively. Several works have used different learning objectives, including consistency on feature maps, consistency on probability mass functions, and maximizing mutual information to distill knowledge. To circumvent the associated computational costs of pre-training a teacher, deep mutual learning (DML) [6] proposed online knowledge distillation. In addition to a primary supervised loss, DML involves training each participant model with a distillation loss that aligns the current model's class posterior probabilities with those of the other models in the cohort. However, these methods emphasize task-specific distillation (e.g., image classification) during the fine-tuning phase, as opposed to task-agnostic distillation during the pre-training phase for representation learning.
To address the issue of self-supervised pre-training of smaller models, Distill on the Go (DoGo) [5] proposes a self-supervised learning paradigm using single-stage online knowledge distillation to improve the representation quality of the smaller models. DoGo employs a deep mutual learning strategy in which two models collaboratively learn from each other to improve each other. However, there is still room for improvement. A new study on improving self-supervised pre-training of smaller models could highlight the shortcomings of the prior art by enabling effective online knowledge distillation. Specifically, joint training with more than two peers, online knowledge distillation on larger datasets such as ImageNet, and downstream performance on other tasks such as object detection and segmentation are some of the useful future research directions for this work.
References:
[1] Chen, Ting, et al. "A simple framework for contrastive learning of visual representations." International conference on machine learning. PMLR, 2020.
[2] Grill, Jean-Bastien, et al. "Bootstrap your own latent-a new approach to self-supervised learning." Advances in neural information processing systems 33 (2020): 21271-21284.
[3] Chen, Ting, et al. "Big self-supervised models are strong semi-supervised learners." Advances in neural information processing systems 33 (2020): 22243-22255.
[4] Fang, Zhiyuan, et al. "Seed: Self-supervised distillation for visual representation." arXiv preprint arXiv:2101.04731 (2021).
[5] Bhat, Prashant, Elahe Arani, and Bahram Zonooz. "Distill on the Go: Online Knowledge Distillation in Self-Supported Learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[6] Zhang, Ying, et al. "Deep mutual learning." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
[7] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network, 2015.