Vision and Language Modelling

When we communicate, we use linguistic form to convey meanings. Meanings in part arise from a relationship between linguistic strings (words and phrases) and the world of perception and experience. Vision-Language models address one part of this relationship, namely, the grounding of linguistic expressions in visual data. Such models are multimodal, in that they are trained on combinations of images (or videos) and corresponding texts, such as captions or short narratives.

In the Vision and Language lab, we explore several properties of the Vision-Language interface, including:

  • Visually grounded generation: we develop datasets and models to generate text from images, from descriptive captions to narrative texts.
  • Grounded natural language understanding: through evaluation techniques such as foil-based tasks, as well as probing methods, we seek to understand the scope and limits of the linguistic grounding capabilities of multimodal models.
  • Multimodal Explainable AI (XAI) frameworks: Deep, neural models tend to be black boxes. We are interested in developing XAI techniques to understand how models for generation and understanding ground their choices in visual inputs.