28.04.2025
Zhengyang Yu
Toddlers learn to recognize objects from different viewpoints with almost no supervision. Recent works argue that toddlers develop this ability by mapping close-in-time visual inputs to similar representations while interacting with objects. High-acuity vision is only available in the central visual field, which may explain why toddlers (much like adults) constantly move around their gaze during such interactions. It is unclear whether/how much toddlers curate their visual experience through these eye movements to support their learning of object representations. In this work, we explore whether a bio-inspired visual learning model can harness toddlers’ gaze behavior during a play session to develop view-invariant object recognition. Exploiting head-mounted eye tracking during dyadic play, we simulate toddlers’ central visual field experience by cropping image regions centered on the gaze location. This visual stream feeds time-based self-supervised learning algorithms. Our experiments demonstrate that toddlers’ gaze strategy supports the learning of invariant object representations. Our analysis also reveals that the limited size of the central visual field where acuity is high is crucial for this. Overall, our work reveals how toddlers’ gaze behavior supports self-supervised learning of view-invariant object recognition.