RNN有很多缺点:
linear interaction distance
lack of parallelizability
Attention形象化理解:a soft averaging lookup table
Self-attention : keys, queries and values from the same sequence
The first problem of Self-attention : sequence order. We introduce Position Value
Position representation values from Sinusoids
Position representation vectors learned from scratch
The second problem of Self-attention : Adding nonlinearities in self-attention
The third problem of Self-attention : Masking the future in self-attention
总结: