Rotating The Way We View Position Embeddings
Written by Shirley Wang. A discussion of the paper titled “RoFormer: Enhanced Transformer with Rotary Position Embedding”.
If you have taken any linear algebra class, you may remember this familiar formula: the dot product of two vectors q and k can be calculated as the product of their norms multiplied by the cosine of the angle between them. This concept is the basis of one of the coolest new innovations in position embedding: Rotary Position Embeddings, named after how the positions can be represented as rotations.
Position embeddings are used in transformers to give them an idea of positions. I’ll give a brief overview of transformers and attention to those unfamiliar with the concepts.
Transformers are one of the hot new models these past recent years, really popularized by the original paper ‘Attention Is All You Need’, initially for NLP but is quickly spreading to many other fields as well. They make use of many attention layers, as opposed to how NLP was originally dominated by recurrent neural networks. While recurrent networks have to look at each word one by one, and as a result have issues with long term memory, transformers have the entire sentence look at itself simultaneously to determine what each word should “pay attention to” amongst itselves. The exact formula for how attention is calculated is given below, where the matrices Q, K, and V are created from the word tokens, and d_k is the number of dimensions.
This idea of “attention”, where each word token looks at the entire phrase and determines for itself which words are the most important for it to pay attention to, has been shown to be extremely powerful. One component of making sure that the attention calculated is useful is making sure that the tokens have the notion of position in them. Thinking at a high level, the first word and last word of a paragraph are usually quite unrelated, while two words that are next to each other are probably very connected in their meaning. This is where position embeddings come into play.
There are two main types of position embeddings: absolute and relative. Absolute position embeddings encode the absolute position of a word in the input phrase, the first word has position 1, the 50th word has position 50. Relative position embeddings encode the relative position two words have to each other, so the relative position between words 7 and 10 in a phrase would be 3. Usually these are incorporated into the transformer by adding them to the word embeddings, or concatenated to them, before the entire thing gets sent through the transformer.
The motivation for rotary position embeddings is simple: for vectors q and k at positions m and n, we would like the inner product of the two vectors to only depend on q, k, and their relative distance m — n. Sparing you from the entire derivation process, the position embedding that fits this criterion is the rotation matrix where the angle is the vector’s position, and this rotation matrix is thenembedded into the original vector by matrix multiplication instead of addition.
In the original attention, the matrix multiplication between query and key matrices only involves the weight matrices W and the input embeddings x. The rotary position matrices R are added in by matrix multiplication for both query and key, and they get multiplied together once the query is transposed.
The reason this works is because the inner product of two rotation matrices is simply the rotation matrix of the difference in angles between them (you can confirm this using trigonometry identities learnt in high school), thus fulfilling our criteria. This is done directly during the attention calculation, so it can take into account the relative distance between different words easily, instead of having distilled information added to the vectors initially.
The model the original paper uses to support these position embeddings is a transformer model they call RoFormer. It is simply the WoBERT model, but with rotary position embeddings instead of the absolute embeddings the original model used. The paper claims that their RoFormer achieved around 2% better in terms of accuracy than the original on the validation and test sets, from just this change in position embeddings. They also claim that their experiments found that the accuracy of RoFormer is also very good for long sequence lengths. So far, rotary position embeddings match or surpass all existing methods to inject position information into transformers.
Rotary position embeddings first got popular on some Chinese NLP blog posts, and with its paper being published in English as of April 2021, I personally feel it will become a bigger part of transformer architecture in the future. This simple concept of a rotation used in a creative way is able to capture positional information in a more effective way than previously before.
References
[1] Alammar, J. (2018, June 27). The illustrated transformer. Retrieved November 27, 2021, from https://jalammar.github.io/illustrated-transformer/.
[2] Biderman, S., Black, S., Foster, C., Gao, L., Hallahan, E., He, H., Wang, B., & Wang, P. (2021, August 17). Rotary embeddings: A relative revolution. EleutherAI Blog. Retrieved November 27, 2021, from https://blog.eleuther.ai/rotary-embeddings/.
[3] Su, J., Lu, Y., Pan, S., Wen, B., & Liu, Y. (n.d.). RoFormer: Enhanced Transformer with Rotary Position Embedding. ArXiv. Retrieved from https://arxiv.org/abs/2104.09864.
[4] Weng, L. (2018, June 24). Attention? attention! Lil’Log. Retrieved November 27, 2021, from https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html.