Teaching an AI to Read and Comprehend

Teaching a chatbot to Read and Comprehend has been elusive until recently. While DeepMind seems to have figured how to search documents to find answers, Microsoft Research has a chatbot that can answer questions in a number of languages.

Machine Reading Using Neural Machines

Teaching a chatbot to read, process and comprehend natural language documents and images is a coveted goal in modern AI. We see growing interest in machine reading comprehension (MRC) due to potential industrial applications as well as technological advances, especially in deep learning and the availability of various MRC datasets that can benchmark different MRC systems. Despite the progress, many fundamental questions remain unanswered: Is question answer (QA) the proper task to test whether a machine can read? What is the right QA dataset to evaluate the reading capability of a machine? For speech recognition, the switchboard dataset was a research goal for 20 years – why is there such a proliferation of datasets for machine reading? How important is model interpretability and how can it be measured? This session will bring together experts at the intersection of deep learning and natural language processing to explore these topics.

In the October 2015 paper, Teaching Machines to Read and Comprehend, the authors from Google DeepMind and the University of Oxford proposed a supervised learning method using deep neural networks that learn to read real documents and answer complex questions with minimal prior knowledge of language structure.

Teaching Machines to Read and Comprehend

Teaching machines to read natural language documents remains an elusive challenge. Machine reading systems can be tested on their ability to answer questions posed on the contents of documents that they have seen, but until now large-scale training and test datasets have been missing for this type of evaluation. In this work, we define a new methodology that resolves this bottleneck and provides large scale supervised reading comprehension data. This allows us to develop a class of attention based deep neural networks that learn to read real documents and answer complex questions with minimal prior knowledge of language structure.


  • Build a supervised reading comprehension dataset using a news corpus.
  • Compare the performance of neural models and state-of-the-art natural language processing model for reading comprehension task.
  • Link to the paper

Reading Comprehension

  • Estimate conditional probability p(a|c, q), where c is a context document, q is a query related to the document, and a is the answer to that query.

Dataset Generation

Question answering dataset featured in “Teaching Machines to Read and Comprehend https://github.com/deepmind/rc-data/

  • Use online newspapers (CNN and DailyMail) and their matching summaries.
  • Parse summaries and bullet points into Cloze style questions.
  • Generate corpus of document-query-answer triplets by replacing one entity at a time with a placeholder.
  • Data anonymized and randomised using coreference systems, abstract entity markers and random permutation of the entity markers.
  • The processed data set is more focused in terms of evaluating reading comprehension as models can not exploit co-occurrence.

Cloze style questions

Cloze style questions are fill in the blank questions. Words may be deleted from the text in question either mechanically (every nth word) or selectively, depending on exactly what aspect it is intended to test for. The methodology is the subject of an extensive academic literature; nonetheless, teachers commonly devise ad hoc tests.

A language teacher may give the following passage to students:

Today, I went to the ________ and bought some milk and eggs. I knew it was going to rain, but I forgot to take my ________, and ended up getting wet on the way.


Baseline Models

  • Majority Baseline
    • Picks the most frequently observed entity in the context document.
  • Exclusive Majority
    • Picks the most frequently observed entity in the context document which is not observed in the query.

Symbolic Matching Models

  • Frame-Semantic Parsing
    • Parse the sentence to find predicates to answer questions like “who did what to whom”.
    • Extracting entity-predicate triples (e1,V, e2) from query q and context document d
    • Resolve queries using rules like exact matchmatching entity etc.
  • Word Distance Benchmark
    • Align placeholder of Cloze form questions with each possible entity in the context document and calculate the distance between the question and the context around the aligned entity.
    • Sum the distance of every word in q to their nearest aligned word in d
2 Layer Deep LSTM Reader Teaching an AI
2 Layer Deep LSTM Reader

Neural Network Models

  • Deep LSTM Reader
    • Test the ability of Deep LSTM encoders to handle significantly longer sequences.
    • Feed the document query pair as a single large document, one word at a time.
    • Use Deep LSTM cell with skip connections from input to hidden layers and hidden layer to output.
  • Attentive Reader
    • Employ attention model to overcome the bottleneck of fixed width hidden vector.
    • Encode the document and the query using separate bidirectional single layer LSTM.
    • Query encoding is obtained by concatenating the final forward and backwards outputs.
    • Document encoding is obtained by a weighted sum of output vectors (obtained by concatenating the forward and backwards outputs).
    • The weights can be interpreted as the degree to which the network attends to a particular token in the document.
    • Model completed by defining a non-linear combination of document and query embedding.
  • Impatient Reader
    • As an add-on to the attentive reader, the model can re-read the document as each query token is read.
    • Model accumulates the information from the document as each query token is seen and finally outputs a joint document query representation in the form of a non-linear combination of document embedding and query embedding.


  • Attentive and Impatient Readers outperform all other models highlighting the benefits of attention modelling.
  • Frame-Semantic pipeline does not scale to cases where several methods are needed to answer a query.
  • Moreover, they provide poor coverage as a lot of relations do not adhere to the default predicate-argument structure.
  • Word Distance approach outperformed the Frame-Semantic approach as there was significant lexical overlap between the query and the document.
  • The paper also includes heat maps over the context documents to visualise the attention mechanism.


Teaching Machines to Read and Comprehend Citations:



Implementation of Teaching Machines to Read and Comprehend code on GitHub:


This repository contains an implementation of the two models (the Deep LSTM and the Attentive Reader) described in Teaching Machines to Read and Comprehend by Karl Moritz Hermann and al., NIPS, 2015. This repository also contains an implementation of a Deep Bidirectional LSTM.

The three models implemented in this repository are:

  • deepmind_deep_lstm reproduces the experimental settings of the DeepMind paper for the LSTM reader
  • deepmind_attentive_reader reproduces the experimental settings of the DeepMind paper for the Attentive reader
  • deep_bidir_lstm_2x128 implements a two-layer bidirectional LSTM reader

Our results

We trained the three models during 2 to 4 days on a Titan Black GPU.


Thomas Mesnard

Alex Auvolat

Étienne Simon


We would like to thank the developers of Theano, Blocks and Fuel at MILA for their excellent work.

We thank Simon Lacoste-Julien from SIERRA team at INRIA, for providing us access to two Titan Black GPUs.

Theano implementation of Deep LSTM Reader & Attentive Reader from Google DeepMind’s paper Teaching Machines to Read and Comprehend – Hermann et al. (2015):



  • Python 2.7
  • Numpy
  • Theano
  • Scikit-learn (for computing F1 score)

Acknowledgment: This code uses a portion of Data reading interface written by Danqi Chen.

Try the Teaching an AI yourself on Google Colab:


Learning to Learn

In a 2016 paper, Learning to Learn in TensorFlow, authors from Google DeepMind, the University of Oxford, and the Canadian Institute for Advanced Research used Learning to learn by gradient descent by gradient descent:

Learning to Learn Abstract

The move from hand-designed features to learned features in machine learning has been wildly successful. In spite of this, optimization algorithms are still designed by hand. In this paper we show how the design of an optimization algorithm can be cast as a learning problem, allowing the algorithm to learn to exploit structure in the problems of interest in an automatic way. Our learned algorithms, implemented by LSTMs, outperform generic, hand-designed competitors on the tasks for which they are trained, and also generalize well to new tasks with similar structure. We demonstrate this on a number of tasks, including simple convex problems, training neural networks, and styling images with neural art.



python train.py --problem=mnist --save_path=./mnist

Command-line flags:

  • save_path: If present, the optimizer will be saved to the specified path every time the evaluation performance is improved.
  • num_epochs: Number of training epochs.
  • log_period: Epochs before mean performance and time is reported.
  • evaluation_period: Epochs before the optimizer is evaluated.
  • evaluation_epochs: Number of evaluation epochs.
  • problem: Problem to train on. See Problems section below.
  • num_steps: Number of optimization steps.
  • unroll_length: Number of unroll steps for the optimizer.
  • learning_rate: Learning rate.
  • second_derivatives: If true, the optimizer will try to compute second derivatives through the loss function specified by the problem.


python evaluate.py --problem=mnist --optimizer=L2L --path=./mnist

Command-line flags:

  • optimizerAdam or L2L.
  • path: Path to saved optimizer, only relevant if using the L2L optimizer.
  • learning_rate: Learning rate, only relevant if using Adam optimizer.
  • num_epochs: Number of evaluation epochs.
  • seed: Seed for random number generation.
  • problem: Problem to evaluate on. See Problems section below.
  • num_steps: Number of optimization steps.


The training and evaluation scripts support the following problems (see util.py for more details):

  • simple: One-variable quadratic function.
  • simple-multi: Two-variable quadratic function, where one of the variables is optimized using a learned optimizer and the other one using Adam.
  • quadratic: Batched ten-variable quadratic function.
  • mnist: Mnist classification using a two-layer fully connected network.
  • cifar: Cifar10 classification using a convolutional neural network.
  • cifar-multi: Cifar10 classification using a convolutional neural network, where two independent learned optimizers are used. One to optimize parameters from convolutional layers and the other one for parameters from fully connected layers.

New problems can be implemented very easily. You can see in train.py that the meta_minimize method from the MetaOptimizer class is given a function that returns the TensorFlow operation that generates the loss function we want to minimize (see problems.py for an example).

It’s important that all operations with Python side effects (e.g. queue creation) must be done outside of the function passed to meta_minimize. The cifar10 function in problems.py is a good example of a loss function that uses TensorFlow queues.

Disclaimer: This is not an official Google product.

Leave a Reply

Your email address will not be published. Required fields are marked *