Chit Thu Shine

Towards Understanding about Transformer Neural Network
Co-author: Khin Radanar Pyae Phyo  

      Transformer is a kind of Deep Neural Network that is based on Encoder-Decoder architecture. The primary application area is Natural Language Processing. In these days, transformer based methods are becoming an interesting research in many application areas such as computer vision, information retrieval, recommender systems, time-series forecasting and many more. It is interesting why it has become an interesting one in current and future research. Which part of component design or idea is significant? How would its architecture be?  The intention of this article is to enable those with some technical background to understand how the mechanisms work and what the business use cases are.

      The core component of the Transformer Neural Network is attention mechanism. Unlike Recurrent Neural Network - RNNs (LSTM, GRU, etc), Transformer can work with parallel processing and solve long range dependency problems. In RNNs, it works like an autoregressive model which is sequentially. Each timestep is encoded using previous timestep features. And it has limited capacity to memorize long range dependencies. It suffers a vanishing/exploding gradient problem when the sequence is longer to lookback. For that reason, many researchers have tried to solve this problem with convolution based methods. Although convolution based methods can improve in long range dependency phenomena, it suffers significant cost in number of hyper parameters that lead to computation performance. 


Attention Mechanism

      The Transformer Neural Network with Self-Attention mechanism was introduced in 2017 by a team of Google with widely cited paper: Attention Is All You Needed. The attention mechanism supports which part of the input vector should focus when generating the output vector.  

      To illustrate this with example, let’s look at the Machine Translation Challenge. Consider the following sentences and their French translation:


      According to the sentence, the refer word of the pronoun ‘it’ is not the same. In the first sentence, it refers to `animal` and `street` in the second sentence. In order to satisfy this case, a mechanism is needed to learn which word has more weight to attend. The following figure shows how `it` attends to each word of the sentence using Attention mechanism. The more attention weight, the intensity of color is increased.


self attention

The encoder self-attention distribution for the word “it” from the 5th to the 6th layer of a Transformer trained on English to French translation (one of eight attention heads).

      Source for above image is from Google AI Blog: Transformer: A Novel Neural Network Architecture for Language Understanding.


Scaled dot-product attention

      The primary logic of attention mechanism is scaled dot-product attention. The input consists of three vectors containing query (Q), keys (K) and values (V). So, what do these represent?  A query is a context of what we are looking for. A key is an encoded representation of “value” and value is the original feature representation. The concept of this comes from the retrieval system. In some cases, the “values” itself is used as a “key”.



      Another concept idea is masking which is ignoring values that are not relevant with the query. For example, for the query “ferry”, the attention should focus on waiting lines and the ticket sign like the following figure.



      After the query, keys, values concept has accumulated, let’s drive to the mathematical formula explanation.


scale product

      The first step is to make a dot product of query and keys that produce the score matrix. The score matrix determines how much it focuses on other words in the sentence. The higher the score, the more it makes focusing. This is how the queries are mapped to the keys.


score matrix

Attention score matrix from the dot product


      To avoid the exploding gradient problem, the score matrix is scaled by dividing with the square root of dimension dk


      Next, attention weights are calculated by applying a softmax function to the above matrix. Softmax function gives the probability values between 0 and 1 in which the higher score represents the more attention weight. 



      Next, it multiplies attention weights and the value vector to get the output vector. The higher the attention weight tends to the value to be larger. This allows the model to be learned easily which words should attend more and which ones less.

attention value


      That is all for the foundation concept for the attention mechanism. In the Transformer Neural Network, this can be seen as single head attention and will operate as parallel processing called Multi-Head Attention. Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information at different positions from different representational spaces. 


Transformer Architecture Overview

      The Transformer uses masking and attention layers as its core block. It is a state-of-the-art technique that was  introduced in 2017 and has been rapidly yielding exciting results in the field of NLP and sequence-to-sequence processing. Transformers with Self-Attention mechanism was introduced in 2017 by a team at Google with Vaswani et al., in a paper entitled Attention is All You Need

      Let’s see how the transformer works with the image below that will tell you a thousand words at one glance.


 Transformer architecture


Transformer Components

      As you have seen before, a transformer has two main sectors, the first is an encoder that operates primarily on the input sequence and the second is a decoder that operates on the target output sequence during training and predicts the next item in the sequence. 

1. Encoder Block


     We already know that the machine doesn’t understand words and it works on vectors,metrics and numbers. So, we need to turn words into vectors. Is this possible? Of course, there is a concept of embedding. Embedding assigns a particular numeric value to each word. But another issue is that every word in different sentences has a different meaning. Positional Encoders will solve this problem as it is a vector that gives context according to the position of the word in a sentence. The final vector received from the positional encoder is passed to the attention block.

      The attention block’s main target is which parts of the text we should focus on? 


self step by step

 Self-attention’s step by step computing

      For every word we have an attention vector generated, which captures the contextual relationship between words in that sentence. To achieve self-attention, we feed the input into 3 distinct fully connected layers to create the query, key, and value vectors. The step-by-step process of these 3 vectors are already shown in the previous ‘ Scaled dot-product attention ’ section.

vector example

 Self-attention vector in a example sentence

      The problem in a single-head attention mechanism is it gives more weights on itself (which is the focus word) in the sentence. This gives less interaction with other words and leads to bad accuracy. That’s why multi-head vectors are used for each word and takes weighted average vectors, to compute the final attention vector for different linguistic elements such as part of speech, tense, nouns, verbs, and so on. Thus, positional encoding mentioned earlier is important for this step.  

 Encoder in Transformer architecture

    The multi-headed attention output vector is added to the original positional input embedding. This is called a residual connection which is (X+Z ) in the figure above. The LayerNorm normalizes and learns an affine transformation at the feature level.The output of the residual connection goes through a layer normalization. 

    Simple feed-forward neural network is applied in every attention vector to transform the attention vectors into a form which is acceptable by the next encoder or decoder layer. Unlike RNNs, each attention vector is independent of each other. Parallelization can be applied here and this makes all the differences.

2. Decoder Block 


Decoding in Transformer architecture

      At first we have the Embedding layer and Positional encoder part which changes the words into respective vectors, It is similar to what we have seen in the Encoder part. 

      Masking is important in decoding. To understand more, let me explain with a simple example.


      For example, when computing attention scores on the word “am”, you should not have access to the word “fine”, because that word is a future word. The word “am” should only have access to itself and the words before ( <start>,I, am ). How can we prevent  computing attention scores from future words? To prevent the decoder from looking at future tokens, look-ahead masking is applied. 

product score

      The reason for the mask is because once you take the softmax of the masked scores, setting negative infinity attention scores for future tokens. 

      Now each attention vector is passed into a feed forward unit, it will make the output vectors form which is acceptable by another decoder block or a linear layer. 

Final Linear and Softmax layer

softmax layer

From decoder’s output into softmax layer


       The decoder outputs a vector of floats. How can we turn that vector into a word? That’s the job of the final Linear layer which is followed by a Softmax Layer.   

      The Linear layer is a simple fully connected neural network that flats the vector produced by the stack of decoders ( decoder matrix ), into a  larger vector called a logits vector. 

      The softmax layer turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output.


Application Areas

      Transformer Neural Neural is widely used in many applications. From the original research for Natural Language Processing applications, it has currently applied in various areas such as computer vision, music generation, information retrieval and augmented reality.

      Transformer has now become the primary core used Neural Network architecture of Natural Language Processing applications such as speech recognition, translation, sentiment analysis and text to speech. Currently, Google Translate can translate many languages well which is based on Transformer architecture. 


      Recently, Microsoft used Transformer in one of their research works, hologram. In their hologram, they combine computer Vision (CNNs) with object detection and Speech Recognition with Machine Translation Transformers working side by side to enable a highly personalised experience (see video below as an example). 

      An example of the research work by Microsoft and others using Transformers for Neural Translation is provided by Xia et al. Tied Transformers: Neural Machine Translation with Shared Encoder and Decoder.

      In the future of the 5G network, languages will not be a barrier for us with Transformers based Neural Machine Translation (NMT). It could be listened to in any other language speech with headphones that could be translated to our mother language. And, AR glasses could also translate to mother language to every text in like billboards in any location. 

      In Computer Vision tasks, Image Transformer is also efficient compared to the state-of-the-art Convolutional Neural Networks (CNNs). Although they are great to extract visual features, they are not able to learn dependencies between them. Therefore, Transformers are becoming popular to use in image reconstruction or enhancement applications such as super-resolution, visual painting and denoising to text synthesizing of images from natural language descriptions.


Microsoft AI Research: Image Enhancement using Transformer


      Using Transformer Neural Network, Google has been doing amazing work in music AI and recently they posted demos created by their Music Transformer. The goal was to generate longer pieces of music that had more coherence because the model was using relative attention. 

      The following three music examples are generated by Music Transformer. It is like  human beings' creation.  Actually, these are created by training MIDI files from the e-competition recorded on Yamaha Disklaviers. 


      Transformer based methods are also used in the medical domain to find drugs which is a very laborious, long and costly process. Another example of a Transformer research application is Gaming alongside an LSTM and Deep Reinforcement Learning by DeepMind in the strategy game AlphaStar.

Evaluating AlphaStar against professional players


        AI researchers are researching and proposing papers about Transformer with LSTM models in time series prediction and applied in stock prediction,disease forecasting with complex patterns and forecasting problems. Time delay embeddings from historical data are processed as input and one-step ahead prediction is performed by trained a Transformer model.

      Alibaba’s Taobao, China’s largest Consumer-to-Consumer (C2C) platform proposed a behaviour sequence transformer (BST) recommendation system as a mobile application. Since filtering methods and a transformer layer are applied, a wide range of active and satisfied  recommendations are delivered to the customers.  BST is able to successfully capture sequential signals matching with user behaviour and items features’ context information.



      To recap, transformers have been implemented to use for NLP applications, originally. But now it gives favourable results in other application areas such as recommendation, time series forecasting, computer vision, music generation, augmented reality and many more. Our research team has also tried to use a Transformer based sentence encoder model in the recommendation system. In future, we have planned to try Transformer based methods in our time series data analysis research. We shared our understanding about Transformer as simply as we can. We hope that it will help to get ideas for doing research related to the Transformer Neural Network. And we are also welcome to do collaboration with our team for future interesting research projects.

Chit Thu Shine

AI Research Manager