NewsStories:
Illustrating articles with visual summaries

Reuben Tan Bryan Plummer Kate Saenko JP Lewis Avneesh Sud Thomas Leung

Boston University Google Research

In ECCV 2022

Paper | Code and dataset

Abstract

Recent self-supervised approaches have used large-scale image-text datasets to learn powerful representations that transfer to many tasks without finetuning. These methods often assume that there is one-to-one correspondence between its images and their short captions. However, many tasks require reasoning about multiple images and long text narratives, such as describing news articles with visual summaries. Thus, we explore a novel setting where the goal is to learn a self-supervised visual-language representation that is robust to varying text length and the number of images. In addition, unlike prior work which assumed captions have a literal relation to the image, we assume images only contain loose illustrative correspondence with the text. To explore this problem, we introduce a large-scale multimodal dataset containing over 900,000 articles grouped into 350,000 stories and more than 750,000 images. We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images. Finally, we introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.

NewsStories Dataset

Citation


@inproceedings{TanNews2022,
    author    = {Reuben Tan and Bryan A. Plummer and Kate Saenko and J. P. Lewis and Avneesh Sud and Thomas Leung},
    title     = {NewsStories: Illustrating articles with visual summaries},
    booktitle = {The European Conference on Computer Vision (ECCV)},
    year      = {2022}
}