Motivation

For the motivatng example of a sequence-sequence task (such as machine translation), this will involve having an input sequence $x_{1},x_{2}....x_{n}$ and an output sequence $y_{1},y_{2}....y_{n}$.

One way of completing of performing the machine translation task is to use a recurrent encoder $f_{encoder}$ which takes in the input sequence and generates a summary of the sequence $c_{n}$. A recurrent decoder will make a prediction of the $i^{th}$ word by taking in $c_n$ as well the previous prediction $\hat{y}{i-1}$ to produce $\hat{y}{i}$.

An issue with this approach is that $c_n$ summarises the whole input sequence, which leads to a bottleneck of information. Furthermore, predictions of different words in the sequence would benefit from focusing more on particular parts of the input sequence. Attention is a way for the decoder to look at the entire input sequence (not a summary of the entire sequence) at each decoding step and decide at each step which parts of the input are important for the prediction at time decoding step.

General Attention mechanism

The encoder has a sequence of hidden state $h_1,h_2, ...h_n$

We desire to have a recursive for the decoder where the hidden state of the decoder for the $i^{th}$ timestep is based on the hidden state of the previous state step, the previous prediction and a context vector.

$s_i = f(s_{i-1},y_{i-1},c_i$)

How the context vector is obtained

The context vector for the $i^{th}$ step is obtained by examining which parts of the input sequence are most related to the current time step. This is done by computing a similarity between decoder hidden state $s_{i-1}$ and the encoder hidden state at different time steps.

$e_{i,j} = a(s_{i-1},h_{j})$

where $a$ is a function that measures the similarity and outputs a scalar value.

The similarity is obtained for all n timesteps $e_{i,1}..... e_{i,n}$ and these are then normalized to $\alpha_{i,1}.... \alpha_{i,n}$ using a softmax.

$$ \alpha_{i,j} = \frac{exp(e_{i,j})}{\sum_{k=1}^{n}exp(e_{i,k})} $$

The context vector $c_i{}$ is then obtained from a weighted average of the normalizing scores $\alpha_{i,1}.... \alpha_{i,n}$