From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions

We provide below the data files that make up the denotation graph we have created from the Flickr 30k Dataset. The graph consists of a set of strings that define the nodes of the graph (dog, running, grass, etc...), the edges that connect those nodes (dog running can be created from running by adding the subject dog, and can be created from dog by adding the verb running), and the images that depict each string in the graph.

Additionally, we have computed two different kinds of denotational similarity metrics computed on the nodes in the graph: (normalized) pointwise mutual information, PMI(s, t), and conditional probabilities, P(s | t ). Our paper shows that these similarity metrics are at least as beneficial as distributional similarities for two tasks that require semantic inference. For example:

Approximate Textual Entailment

The approximate textual entailment task generates textual entailment items using the Flickr 30k Dataset and our denotation graph. We use captions from the Flickr 30k Dataset as premises, and try to determine if they entail strings from the denotation graph.

Premises:	A woman with dark hair in bending, open mouthed, towards the back of a dark headed toddler's head.
	A dark-haired woman has her mouth open and is hugging a little girl while sitting on a red blanket.
	A grown lady is snuggling on the couch with a young girl and the lady has a frightened look.
	A mom holding her child on a red sofa while they are both having fun.
Hypothesis:	make face

Downloads

Please fill in the following form to request access to the Flickr 30k Dataset and the Denotation Graph. Note that the Flickr 30k Dataset includes images obtained from Flickr. Use of the images must abide by the Flickr Terms of Use. We do not own the copyright of the images. They are solely provided at the link below for researchers and educators who wish to use the dataset for non-commercial research and/or educational purposes.

^{1. In our actual denotation graph, words are
lemmatized, so two dogs running becomes two dog run.}

May	JUN	Jul
	29
2019	2020	2021

s	PMI(s, play baseball)	P(play baseball \| s)
tag him	0.673	0.600
hold bat	0.627	0.368
try to tag	0.616	0.517
slide into base	0.569	0.278
hold bat	0.627	0.368
pitch	0.561	0.200

From image descriptions to visual denotations:
New similarity metrics for semantic inference over event descriptions

Abstract

30k Image Caption Corpus

Denotation Graph

Approximate Textual Entailment

Downloads