Every second, around 107 to 108 bits of information reach the human visual system (HVS) [IK01]. Because biological hardware has limited computational capacity, complete processing of massive sensory information would be impossible. The HVS has therefore developed two mechanisms, foveation and fixation, that preserve perceptual performance while resulting in computational savings.
The human eye is structured in a unique way. Photoreceptors, which are specialized cells that respond to light, are not evenly distributed on the retina, which is located in the rear of the human eye. The density of photoreceptor cells is highest in the center of the retina and progressively decreases as the eccentricity increases [CSKH90]. As a result, when a human observer looks at a location in a real-world image, the eye transmits a variable resolution image to the high-level processing units of the human brain. This is known as foveation.
The foveation mechanism functions similarly to an image compression process, significantly compressing the out-of-center parts. However, foveation alone may impair perceptual function since details of the surroundings may be missed. As a workaround, the HVS has created a set of eye movements that direct the gaze to specific points in an image. The eye is drawn to salient parts in a scene as potential points of attention and for further processing, allowing it to generate a precise map of the scene from a sequence of images of varying resolution [IK01, BT09]. Searching the scene with eye movements is called a fixation mechanism.
Motivated by biological intuition, several works focused on leveraging foveation and fixation mechanisms for machine learning approaches [VBPP20, AE17, TLCR21, JWE21]. Studies show that artificial foveation and fixation mechanisms make neural network architectures more robust to noise and adversarial attacks, in addition to being more efficient. Furthermore, explicitly learning the fixation points enables the design of visualization techniques to interpret the decision of the model.
Recent years have seen a rise in the importance of video action recognition and prediction issues due to their increasing practical applications in areas such as visual surveillance and autonomous driving. Action recognition is the process of determining current actions from previously executed actions. On the other hand, action prediction focuses on predicting the future state of the action based on previous incomplete actions [KF22].
The spatiotemporal relationship in the video sequence provides important information for these
tasks. Therefore, several studies [CPR21] utilize this relationship in their work to achieve better
performance. This information is especially useful for learning fixation points for models inspired by
the human visual system because fixation points are likely to be correlated in the temporal domain
as well. HVS-inspired models may benefit from video sequence training to improve performance. As
a result, in addition to the advantages described above, utilizing such models for action recognition
tasks may provide even greater computing efficiency due to the spatiotemporal relationship that
will be investigated in this thesis.
References:
[AE17] Emre Akbas and Miguel P Eckstein. Object detection through search with a foveated visual system. PLoS computational biology, 13(10):e1005743, 2017.
[BT09] Neil DB Bruce and John K Tsotsos. Saliency, attention, and visual search: An information theoretic approach. Journal of vision, 9(3):5–5, 2009.
[CPR21] Chun-Fu Richard Chen, Rameswar Panda, Kandan Ramakrishnan, Rogerio Feris, John Cohn, Aude Oliva, and Quanfu Fan. Deep analysis of cnn-based spatio-temporal representations for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6165–6175, 2021.
[CSKH90] Christine A Curcio, Kenneth R Sloan, Robert E Kalina, and Anita E Hendrickson. Human photoreceptor topography. Journal of comparative neurology, 292(4):497–523, 1990.
[IK01] Laurent Itti and Christof Koch. Computational modelling of visual attention. Nature reviews neuroscience, 2(3):194–203, 2001.
[JWE21] Aditya Jonnalagadda, William Wang, and Miguel P Eckstein. Foveater: Foveated transformer for image classification. arXiv preprint arXiv:2105.14173, 2021.
[KF22] Yu Kong and Yun Fu. Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5):1366–1401, 2022.
[TLCR21] Chittesh Thavamani, Mengtian Li, Nicolas Cebron, and Deva Ramanan. Fovea: Foveated image magnification for autonomous navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15539–15548, 2021.
[VBPP20] Manish Reddy Vuyyuru, Andrzej Banburski, Nishka Pant, and Tomaso Poggio. Biologically inspired mechanisms for adversarial robustness. Advances in Neural Information
Processing Systems, 33:2135–2146, 2020.