While efforts have been made on bridging the semantic gap in image understanding, the in situ understanding of social media images is arguably more important but has had less progress. In this work, we enrich the representation of images in image tweets by considering their social context. We argue that in the microblog context, traditional image features, e.g., low-level SIFT or high-level detected objects, are far from adequate in interpreting the necessary semantics latent in image tweets. To bridge this gap, we move from the images’ pixels to their context and propose a context-aware image bf tweet modelling (CITING) framework to mine and fuse contextual text to model such social media images’ semantics. We start with tweet’s intrinsic contexts, namely, 1) text within the image itself and 2) its accompanying text; and then we turn to the extrinsic contexts: 3) the external web page linked to by the tweet’s embedded URL, and 4) the Web as a whole. These contexts can be leveraged to benefit many fundamental applications. To demonstrate the effectiveness our framework, we focus on the task of personalized image tweet recommendation, developing a feature-aware matrix factorization framework that encodes the contexts as a part of user interest modelling. Extensive experiments on a large Twitter dataset show that our proposed method significantly improves performance. Finally, to spur future studies, we have released both the code of our recommendation model and our image tweet dataset.