Neural Image Captioning

The goal of this project was to tackle the problem of automatic caption generation for images of real world scenes. The work consisted of reimplementing the Neural Image Captioning (NIC) model proposed by Vinyals et al. and running appropriate experiments to test its performance.

The project was carried out as part of the ID2223 "Scalable Machine Learning and Deep Learning" course at KTH Royal Institute of Technology.

To run

Install pip packages and cocoapi:

pip install -r requirements.txt
git clone https://github.com/cocodataset/cocoapi
cd cocoapi/PythonAPI/; make install; cd ../..

Install PyTorch for python 3.5 with CUDA 8.0 (check pytorch.org for other options):

pip3 install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp35-cp35m-linux_x86_64.whl

Fetch the data (also builds a vocabulary):

python fetch_data.py # (to also download test set, run python fetch_data --test)

Start training with default arguments (check train.py for more arguments)

python train.py

Evaluate

python train.py --sample --checkpoint_file <your-checkpoint-file>

Contributors

Martin Hwasser (github: hwaxxer)
Wojciech Kryściński (github: muggin)
Amund Vedal (github: amundv)

References

The implemented architecture was based on the following publication:

"Show and Tell: A Neural Image Captiong Generator" by Vinyals et al. [3]

Datasets

Experiments were conducted using the Common Objects in Context dataset. The following subsets were used:

Training: 2014 Contest Train images [83K images/13GB]
Validation: 2014 Contest Val images [41K images/6GB]
Test: 2014 Contest Test images [41K images/6GB]

Architecture

The NIC architecture consists of two models, the Encoder and a Decoder. The Encoder, which is a Convolutional Neural Network, is used to create a (semantic) summary of the image in a form of a fixed sized vector. The Decoder, which is a Recurrent Neural Network, is used to generate the caption in natural language based on the summary vector created by the encoder.

Experiments

Goals

The goal of the project was to implement and train a NIC architecture and evaluate its performance. A secondary goal, was to check how the type of a recurrent unit and the size of the word embeddings in the Decoder (language generator) affects the overall performance of the NIC model.

Setup

The Encoder was a ResNet-34 architecture with pre-trained weights on the ImageNet dataset. The output layer of the network was replaced with a new layer with a size definable by the user. All weights, except from the last layer, were frozen during the training procedure.

The Decoder was a single layer recurrent neural network. Three different recurrent units were tested, Elman, GRU, and LSTM. The Elman refers to the basic rnn architecture.

Training parameters:

Number of epochs: 3
Batch size: 128 (3236 batches per epoch)
Vocabulary size: 15,000 most popular words
Embedding size: 512 (image summary vector, word embeddings)
RNN hidden state size: 512 and 1024
Learning rate: 1e-3, with LR decay every 2000 batches

Models were implemented in Python using the PyTorch library. Models were trained either locally or on rented AWS instances (both using GPUs).

Evaluation Methods

Experiments were evaluated in a qualitative and quantitative manner. The qualitative evaluation assessed the coherence of the generated sequences and their relevance given the input image, and was done by us manually. The quantitative evaluation enabled comparison of trained models with reference models from the authors. The following metrics were used: BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROGUE-L, METEOR, and CIDEr.

Results

Training Progress

Quantitative

Qualitative results are presented on the Validation and Test sets. Results obtained with the reimplemented model are compared with the results obtained by the authors of the article.

Validation Data
Model	BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROGUE-L	CIDEr
Vinyals et al. (4k subset)	N/A	N/A	N/A	27.7	23.7	N/A	85.5
elman_512	62.5	43.2	29.1	19.8	19.5	45.6	57.7
elman_1024	61.9	42.9	28.8	19.6	19.9	45.9	58.7
gru_512	63.9	44.9	30.5	20.8	20.4	46.6	62.9
gru_1024	64.0	45.3	31.2	21.5	21.1	47.1	66.1
lstm_512	62.9	44.3	29.8	20.3	19.9	46.1	60.2
lstm_1024	63.4	45.0	31.0	21.4	20.8	47.1	64.4

Test Data
_Model	_BLEU-1		_BLEU-2		_BLEU-3		_BLEU-4		_METEOR		_ROGUE-L		_CIDEr
	_c5	_c40	_c5	_c40	_c5	_c40	_c5	_c40	_c5	_c40	_c5	_c40	_c5	_c40
_{Vinyals et al.}	_71.3	_89.5	_54.2	_80.2	_40.7	_69.4	_30.9	_58.7	_25.4	_34.6	_53.0	_68.2	_94.3	_94.6
_{elman_1024}	_61.8	_79.9	_42.8	_66.2	_28.7	_51.9	_19.5	_39.8	_19.9	_26.7	_45.7	_58.4	_58.0	_60.0
_{gru_1024}	_63.8	_81.2	_45.0	_68.1	_30.1	_54.4	_21.3	_42.5	_21.0	_27.8	_47.0	_59.5	_65.4	_66.4
_{lstm_1024}	_63.3	_81.0	_44.8	_67.9	_30.7	_54.0	_21.1	_42.0	_20.7	_27.4	_46.9	_59.2	_63.7	_64.8

Note: The "MSCOCO c5" dataset contains five reference captions for every image in the MS COCO training, validation and testing datasets. "MSCOCO c40" contains 40 reference sentences for a randomly chosen 5,000 images from the MS COCO testing dataset[2].

Note2: We assume the score "Vinyals et al." is the top score of the first author of our main reference paper [3]. It comes from the MSCOCO leaderboard at www.codalab.org, where we evaluated our scores. We decided to map the scores from the range [0,1] displayed on the website to [0,100] to keep consistency with previous table and Vinyals et al. Our results for GRU and LSTM are on place 84 and 85 of the codalab leader board, respectively.

Qualitative

Captions without errors (left-to-right: Elman, GRU, LSTM)

Captions with minor errors (left-to-right: Elman, GRU, LSTM)

Captions somewhat related to images (left-to-right: Elman, GRU, LSTM)

Captions unrelated to image (left-to-right: Elman, GRU, LSTM)

Discussion

Studying the results of our experiments, we noted how increasing the number of hidden units describing the RNN state resulted in improved performance across all models, which matched our expectations. However, it was interesting to see the GRU cell outperform LSTM in both experiments. A possible explanation of this is that for generating relatively short sequences (most captions had up to 20 words) the architecture of the LSTM cell might be overly complex. The sequences might also be too short for the LSTM to shine. Since the LSTM has more trainable parameters when compares to GRU it would be interesting to see if extending the training procedure in the case of LSTM-based networks allows them to obtain the same or better performance than GRU-based networks.

References:

[1] Chung et al. (2014) Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:1412.3555.
[2] X. Chen et al. (2015) Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv:1504.00325.
[3] O. Vinyals et al. (2014) Show and Tell: A Neural Image Captiong Generator, arXiv:1411.4555

Name		Name	Last commit message	Last commit date
Latest commit History 168 Commits
report		report
.gitignore		.gitignore
README.md		README.md
data_loader.py		data_loader.py
dataset_coco.py		dataset_coco.py
downloader.py		downloader.py
eval.py		eval.py
fetch_data.py		fetch_data.py
models.py		models.py
plot-validation-loss.ipynb		plot-validation-loss.ipynb
train.py		train.py
utils.py		utils.py
vocab.py		vocab.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neural Image Captioning

To run

Install pip packages and cocoapi:

Install PyTorch for python 3.5 with CUDA 8.0 (check pytorch.org for other options):

Fetch the data (also builds a vocabulary):

Start training with default arguments (check train.py for more arguments)

Evaluate

Contributors

References

Datasets

Architecture

Experiments

Goals

Setup

Evaluation Methods

Results

Training Progress

Quantitative

Qualitative

Discussion

References:

About

Releases

Packages

Contributors 3

Languages

vedal/show-and-tell

Folders and files

Latest commit

History

Repository files navigation

Neural Image Captioning

To run

Install pip packages and cocoapi:

Install PyTorch for python 3.5 with CUDA 8.0 (check pytorch.org for other options):

Fetch the data (also builds a vocabulary):

Start training with default arguments (check train.py for more arguments)

Evaluate

Contributors

References

Datasets

Architecture

Experiments

Goals

Setup

Evaluation Methods

Results

Training Progress

Quantitative

Qualitative

Discussion

References:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages