We present a multimodal system for aligning scholarly documents to corresponding presentations in a fine-grained manner (i.e., per presentation slide
and per paper section). Our method improves upon a state-of-the-art baseline
that employs only textual similarity. Based on an analysis of errors made by the baseline, we propose a three-pronged alignment system that combines textual, image, and ordering information to establish alignment. Our results show a
statistically significant improvement of 25%. Our result confirms the importance
of emphasizing on visual content to improve document alignment accuracy.