Towards Video Captioning with Naming: a Novel Dataset and a Multi-Modal Approach

Stefano Pini, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
In International Conference on Image Analysis and Processing (ICIAP), 2017
DOI: 10.1007/978-3-319-68548-9_36
Link: Paper

Abstract

Current approaches for movie description lack the ability to name characters with their proper names, and can only indicate people with a generic “someone” tag. In this paper we present two contributions towards the development of video description architectures with naming capabilities: firstly, we collect and release an extension of the popular Montreal Video Annotation Dataset in which the visual appearance of each character is linked both through time and to textual mentions in captions. We annotate, in a semi-automatic manner, a total of 53k face tracks and 29k textual mentions on 92 movies. Moreover, to underline and quantify the challenges of the task of generating captions with names, we present different multi-modal approaches to solve the problem on already generated captions.

@inproceedings{pini2017towards,
  title={Towards video captioning with naming: a novel dataset and a multi-modal approach},
  author={Pini, Stefano and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={International Conference on Image Analysis and Processing},
  pages={384--395},
  year={2017},
  organization={Springer}
}