assignment :
a2.pdf
part 1
B


C

- L2 normalization will take away useful information for the downstream task when the magnitude (length) of the word vectors encodes important information for classification. For example, if two word vectors $\mathbf{u}_x$ and $\mathbf{u}_y$ are in the same direction but have different lengths (i.e., $\mathbf{u}_x = \alpha \mathbf{u}_y$ for some $\alpha > 0$), then after normalization, both become identical unit vectors and their original difference in magnitude is lost. This means that any information carried by the vector norms is discarded, which can affect the classification result if the sum of the raw vectors' magnitudes is important for determining the sign.
- L2 normalization will not take away useful information if only the direction of the word vectors matters for the downstream task, i.e., if the classification depends only on the directions and not on the magnitudes of the vectors. In this case, normalization preserves all relevant information.

D E


Part 2
A
Adam 中的 $m$是梯度的滑动平均(即“动量”),它能平滑掉梯度中的噪声,使参数更新方向主要受最近一段时间的平均梯度影响,而不是单个小批量的偶然波动。这样可以防止参数更新方向频繁大幅度变化(即“抖动”),让学习过程更加稳定。低方差的更新有助于更快收敛,并减少陷入局部极小值或鞍点的风险。