一、 Neural Machine Translation with RNNs
G

Effect of the masks on attention computation:
- The masks ensure that, during attention calculation, any encoder hidden state corresponding to a padding token is assigned an attention score of negative infinity (
-inf
).
- This means that, after applying the softmax, the attention weights for these positions become zero, effectively preventing the decoder from attending to padded (non-informative) positions in the source sequence.
- As a result, the attention mechanism only distributes probability mass over actual (non-pad) source tokens, making the context vector meaningful.
Why it is necessary:
- Without masking, the attention mechanism could assign nonzero weights to padding positions, which do not contain any real information and would corrupt the context vector.
- Masking ensures that only valid source tokens contribute to the attention output, maintaining the integrity of the translation process.
H

I

i. Dot product attention vs. multiplicative attention
- Advantage: Dot product attention is computationally efficient because it only requires a simple inner product between vectors, without any learned parameters.
- Disadvantage: Dot product attention can perform poorly when the query and key vectors have different scales or dimensions, and it lacks the flexibility to learn a transformation between them (which multiplicative attention provides via the weight matrix W).
ii. Additive attention vs. multiplicative attention
- Advantage: Additive attention is more flexible and can better model complex relationships between query and key vectors, since it uses a feed-forward neural network and can handle different vector dimensions.