Multimodal context in multilingual referential communication: Integrating deep learning and human perception

New project funded by AiNed XS Europe in the area of Natural Language Processing

When we use language to refer to objects in our surroundings, our choice of what to say about them emerges from a close interplay between linguistic constraints and perceptual processes. For example, if we want to refer to a car, our choice of what to say about it might depend not only on the distinguishing features of the car itself, but also on the visual context (for example, whether the car is in a showroom, or on a busy road). Human attention depends on salient features, but also on our knowledge of everyday scenes. How does the outcome of this interplay between salience and scene knowledge affect the formulation of a linguistic message in a specific language, such as Dutch or Turkish?

A new collaboration between members of the NLP group in the Department of Information and Computing Sciences, and collaborators in the Computational Linguistic group at Bielefeld University (Germany) will model the interaction between language production and visual attention in visual scenes. The goal of this project, funded by an AINed XS Europe grant, is to model the interaction between perception and language to develop human-centred, cognitively plausible AI models of referential communication. In part, this will be achieved by incorporating human behavioural signals from eye-tracking experiments, integrating them with deep models of visual and linguistic processing, to model the referential process in real-world scenes in a cognitively plausible manner.