一、 Neural Machine Translation with RNNs

G

Effect of the masks on attention computation:

The masks ensure that, during attention calculation, any encoder hidden state corresponding to a padding token is assigned an attention score of negative infinity (-inf).
This means that, after applying the softmax, the attention weights for these positions become zero, effectively preventing the decoder from attending to padded (non-informative) positions in the source sequence.
As a result, the attention mechanism only distributes probability mass over actual (non-pad) source tokens, making the context vector meaningful.

Why it is necessary:

Without masking, the attention mechanism could assign nonzero weights to padding positions, which do not contain any real information and would corrupt the context vector.
Masking ensures that only valid source tokens contribute to the attention output, maintaining the integrity of the translation process.

i. Dot product attention vs. multiplicative attention

Advantage: Dot product attention is computationally efficient because it only requires a simple inner product between vectors, without any learned parameters.
Disadvantage: Dot product attention can perform poorly when the query and key vectors have different scales or dimensions, and it lacks the flexibility to learn a transformation between them (which multiplicative attention provides via the weight matrix W).

ii. Additive attention vs. multiplicative attention

Advantage: Additive attention is more flexible and can better model complex relationships between query and key vectors, since it uses a feed-forward neural network and can handle different vector dimensions.