We present a multimodal system for aligning scholarly documents to corresponding presentations in a fine-grained manner (i.e., per presentation slide and per paper section). Our
method improves upon a state-of-the-art baseline that em-
ploys only textual similarity. Based on an analysis of base
line errors, we propose a three-pronged alignment system that combines textual, image, and ordering information to establish alignment. Our results show a statistically significant improvement of 25%, confirming the importance of
visual content in improving alignment accuracy.