Traditionally separate subfields of machine learning, computer vision and natural language processing have started to converge in the past year. Researchers are applying the Transformer architecture, previously used to achieve state-of-the-art performance in NLP, to computer vision tasks and developing unified architectures.
This project aims to study the inner workings of such vision+language neural network models. Some research questions include: Do these vision-language models process language and images like humans do? How are the representations different from pure language or vision models? We attempt to adapt methods traditionally used to study human language processing to studying these unified models.