This post is the first in a series about im2latex: its goal is to cover the concepts of Sequence-to-Sequence models with Attention and Beam search.

If you’re already familiar with Seq2Seq and want to go straight to the Tensorflow code

> Go to part II


In my last post about named entity recognition, I explained how to predict a tag for a word, which can be considered as a relatively simple task. However, some tasks like translation require more complicated systems. You may have heard from some recent breakthroughs in Neural Machine Translation that led to (almost) human-level performance systems (used in real-life by Google Translation, see for instance this paper enabling zero-shot translation). These new architectures rely on a common paradigm called encoder-decoder (or sequence to sequence), whose goal is to produce an entire sequence of tokens.

In this post, we’ll assume basic knowledge about Deep Learning (Convolutions, LSTMs, etc.). For readers new to Computer Vision and Natural Language Processing, have a look at the famous Stanford classes cs231n and cs224n.

Sequence to Sequence basics

Let’s explain the sequence to sequence framework as we’ll rely on it for our model. Let’s start with the simplest version on the translation task.

As an example, let’s translate how are you in French comment vas tu.

Vanilla Seq2Seq

The Seq2Seq framework relies on the encoder-decoder paradigm. The encoder encodes the input sequence, while the decoder produces the target sequence


Our input sequence is how are you. Each word from the input sequence is associated to a vector $ w \in \mathbb{R}^d $ (via a lookup table). In our case, we have 3 words, thus our input will be transformed into $ [w_0, w_1, w_2] \in \mathbb{R}^{d \times 3} $. Then, we simply run an LSTM over this sequence of vectors and store the last hidden state outputed by the LSTM: this will be our encoder representation $ e $. Let’s write the hidden states $ [e_0, e_1, e_2] $ (and thus $ e = e_2 $)

Vanilla Encoder
Vanilla Encoder


Now that we have a vector $ e $ that captures the meaning of the input sequence, we’ll use it to generate the target sequence word by word. Feed to another LSTM cell: $ e $ as hidden state and a special start of sentence vector $ w_{sos} $ as input. The LSTM computes the next hidden state $ h_0 \in \mathbb{R}^h $. Then, we apply some function $ g : \mathbb{R}^h \mapsto \mathbb{R}^V $ so that $ s_0 := g(h_0) \in \mathbb{R}^V $ is a vector of the same size as the vocabulary.

Then, apply a softmax to $ s_0 $ to normalize it into a vector of probabilities $ p_0 \in \mathbb{R}^V $ . Now, each entry of $ p_0 $ will measure how likely is each word in the vocabulary. Let’s say that the word “comment” has the highest probability (and thus $ i_0 = \operatorname{argmax}(p_0) $ corresponds to the index of “comment”). Get a corresponding vector $ w_{i_0} = w_{comment} $ and repeat the procedure: the LSTM will take $ h_0 $ as hidden state and $ w_{comment} $ as input and will output a probability vector $ p_1 $ over the second word, etc.

The decoding stops when the predicted word is a special end of sentence token.

Vanilla Decoder
Vanilla Decoder

Intuitively, the hidden vector represents the “amount of meaning” that has not been decoded yet.

The above method aims at modelling the distribution of the next word conditionned on the beginning of the sentence

by writing

Seq2Seq with Attention

The previous model has been refined over the past few years and greatly benefited from what is known as attention. Attention is a mechanism that forces the model to learn to focus (=to attend) on specific parts of the input sequence when decoding, instead of relying only on the hidden vector of the decoder’s LSTM. One way of performing attention is explained by Bahdanau et al.. We slightly modify the reccurrence formula that we defined above by adding a new vector $ c_t $ to the input of the LSTM

The vector $ c_t $ is the attention (or context) vector. We compute a new context vector at each decoding step. First, with a function $ f (h_{t-1}, e_{t’}) \mapsto \alpha_{t’} \in \mathbb{R} $, compute a score for each hidden state $ e_{t’} $ of the encoder. Then, normalize the sequence of $ \alpha_{t’} $ using a softmax and compute $ c_t $ as the weighted average of the $ e_{t’} $.

Attention Mechanism
Attention Mechanism

The choice of the function $ f $ varies, but is usually one of the following

It turns out that the attention weighs $ \bar{\alpha} $ can be easily interpreted. When generating the word vas (corresponding to are in English), we expect $ \bar{\alpha} _ {\text{are}} $ to be close to $ 1 $ while $ \bar{\alpha} _ {\text{how}} $ and $ \bar{\alpha} _ {\text{you}} $ to be close to $ 0 $. Intuitively, the context vector $ c $ will be roughly equal to the hidden vector of are and it will help to generate the French word vas.

By putting the attention weights into a matrix (rows = input sequence, columns = output sequence), we would have access to the alignment between the words from the English and French sentences… (see page 6) There is still a lot of things to say about sequence to sequence models (for instance, it works better if the encoder processes the input sequence backwards…).


What happens if the first time step is not sure about wether it should generate comment or vas (most likely case at the beginning of the training)? Then it would mess up the entire sequence, and the model will hardly learn anything…

If we use the predicted token as input to the next step during training (as explained above), errors would accumulate and the model would rarely be exposed to the correct distribution of inputs, making training slow or impossible. To speedup things, one trick is to feed the actual output sequence (<sos> comment vas tu) into the decoder’s LSTM and predict the next token at every position (comment vas tu <eos>).


The decoder outputs vectors of probability over the vocabulary $ p_i \in \mathbb{R}^V $ for each time step. Then, for a given target sequence $ y_1, \dots, y_n $, we can compute its probability as the product of the probabilities of each token being produced at each relevant time step:

where $ p_i [y_i] $ means that we extract the $ y_i $-th entry of the probability vector $ p_i $ from the $i$-th decoding step. In particular, we can compute the probability of the actual target sequence. A perfect system would give a probabilty of 1 to this target sequence, so we are going to train our network to maximize the probability of the target sequence, which is the same as minimizing

in our example, this is equal to

and you recognize the standard cross entropy: we actually are minimizing the cross entropy between the target distribution (all one-hot vectors) and the predicted distribution outputed by our model (our vectors $ p_i $).


The main takeaway from the discussion above is that for the same model, we can define different behaviors. In particular, we defined a specific behavior that speeds up training.

What about inference/testing time then? Is there an other way to decode a sentence?

There indeed are 2 main ways of performing decoding at testing time (translating a sentence for which we don’t have a translation). The first of these methods is the one covered at the beginning of the article: greedy decoding. It is the most natural way and it consists in feeding to the next step the most likely word predicted at the previous step.

Greedy Decoder - feeds the best token to the next step
Greedy Decoder - feeds the best token to the next step

But didn’t we say that this behavior is likely to accumulate errors?

Even after having trained the model, it can happen that the model makes a small error (and gives a small advantage to vas over comment for the first step of the decoding). This would mess up the entire decoding…

There is a better way of performing decoding, called Beam Search. Instead of only predicting the token with the best score, we keep track of $ k $ hypotheses (for example $ k = 5 $, we refer to $ k $ as the beam size). At each new time step, for these 5 hypotheses we have $ V $ new possible tokens. It makes a total of $ 5 V $ new hypotheses. Then, only keep the $ 5 $ best ones, and so on… Formally, define the set of hypotheses decoded at time step .

For instance if , one possible would be

Now we consider all the possible candidates , produced from by adding all possible new tokens

and keep the highest scores (probability of the sequence). If we keep our example

and for instance we can imagine that the 2 best ones would be

Once every hypothesis reached the <eos> token, we return the hypothesis with the highest score.

If we use beam search, a small error at the first step might be rectified at the next step, as we keep the gold hypthesis in the beam!


In this article we covered the seq2seq concepts. We showed that training is different than decoding. We covered two methods for decoding: greedy and beam search. While beam search generally achieves better results, it is not perfect and still suffers from exposure bias. During training, the model is never exposed to its errors! It also suffers from Loss-Evaluation Mismatch. The model is optimized w.r.t. token-level cross entropy, while we are interested about the reconstruction of the whole sentence…

Now, let’s apply Seq2Seq for LaTeX generation from images!

> Go to part II

Classic papers about seq2seq

More advanced papers trying to address some limitations