Discovering the Transformer architecture

Published in

Beneficial Intelligence

16 min readAug 17, 2023

Many Large Language Models or LLMs (like ChatGPT, BERT, LLaMA etc.) are built on the Transformer architecture, invented (originally for machine translation) in 2017 by researchers working with Google Brain and Google Research. This blog post will walk you through three steps to discover it: 1) Ramp-up: Understanding language models in general with a toy model 2) RTFM: Understand the original paper 3) Talk is cheap, show me the code: Play with the Google Trax library which contains the Transformer source code.

Introduction and methodology

What are LLMs for?

From the perspective of practical applications, LLMs can be thought of as the generalization and amalgamation of machine translation, grammar/style suggestion and web search.

Despite the fact that they are referred to as “generative” AI, LLMs don’t only excel in generating new content, but also in interpreting and restructuring existing text. (I don’t even like to use them for text generation, see my earlier post I want to use ChatGPT as “interpreter” rather than “generative” AI.)

As an illustration, I asked ChatGPT to explain what the phrase “Attention Is All You Need” (the title of the original Transformer paper, as you will see later) means outside of AI research. It gave the following, pretty impressive (though to my taste a little bit too verbose) answer:

Outside of the realm of AI research, the phrase “Attention is all you need” can be interpreted more broadly as a statement about the importance of focusing one’s attention or directing one’s awareness in various contexts. While it was originally coined in the context of the “Transformer” model in AI research, it can be applied to convey the idea that by giving careful attention and consideration to a particular subject or task, you can achieve significant results or understanding.
In everyday life, the phrase might be used to emphasize the idea that paying close attention to details, relationships, or specific aspects of a situation can lead to effective problem-solving, insightful comprehension, or successful outcomes. It could also be used as a motivational statement, encouraging individuals to prioritize what truly matters and focus their efforts on the most essential aspects.
For example, in a personal or professional context:
1. Learning and Studying: “Attention is all you need” could be a motto for effective study habits, encouraging students to concentrate fully on their studies to grasp complex concepts and excel academically.
2. Relationships: The phrase might be used to convey the idea that genuine attention, active listening, and empathy are crucial for building and maintaining strong relationships.
3. Productivity: In a work setting, it could suggest that concentrating on the most critical tasks and avoiding distractions can lead to higher productivity and accomplishments.
4. Problem-Solving: The phrase could remind individuals that focusing on the root causes or essential factors of a problem can lead to more effective and efficient solutions.
5. Mindfulness: In the context of mindfulness and meditation, the phrase might underscore the importance of being fully present in the moment and directing one’s attention to the here and now.
Overall, while the phrase originates from AI research, its underlying message about the significance of attention can be applied to various aspects of life where concentration, awareness, and focus are key to achieving meaningful results.

Thesis, antithesis, synthesis

I find the arc of development which was leading up to the invention of the Transformer architecture pretty interesting.

I can imagine that Hegel, the influential 19th century German philosopher would describe this path as an example of the triad pattern “thesis”, “antithesis” and “synthesis” he liked to use to describe the “dialectical” evolution of the “Spirit”. Using this triad pattern, the stages of the development would be the following:

Thesis: Convolutional Neural Networks or CNNs are suitable for processing singleton inputs, e.g. in the case of image recognition. CNNs scan images using filters to extract features and pooling layers to distill information.

Antithesis: Recurrent Neural Networks or RNNs can process streams (or as in the AI literature are called sequences) of information by applying the same function to all elements of the stream one by one, usually maintaining some inner state and also taking the previously outputted predictions as input.

Synthesis: The Transformer architecture can process streams of information without the constraint of sequentiality, applying clever tricks called “positional encoding” and “self-attention”.

The “thesis”, “antithesis”, “synthesis” pattern is, of course, an abstraction, and “all non-trivial abstractions, to some degree, are leaky”. The real story is rather a big dependency graph than a clear sequence of these steps. The related Wikipedia article gives a good overview of the related work before and after the Transformer architecture.

Why sticking to the original paper and code?

There are quite a few publications out there which offer alternative explanations and visualizations for the Transformer architecture rather than sticking to the original paper.

Granted, the paper is quite dense and some background knowledge helps to understand it. But to my taste, “innovative” explanations and visualizations are detractions from the original intellectual achievement without real added value.

Additionally, wrestling with the primary source flexes your muscles for digesting difficult AI papers more than taking the shortcut.

I want to offer you some help to find your way to the original concept. Paradoxically, my help will become useless as soon as you get there.

Prerequisite knowledge

In the following, I suppose that you understand, at least in broad strokes, how neural networks are trained with backpropagation. If this is not the case, you can catch up by reading e.g. Chris Olah’s blog post Calculus on Computational Graphs: Backpropagation or any material you find with a web search.

1) Ramp-up: a toy language model

Before the Transformer architecture and LLMs became cool, in 2015 Andrej Karpathy wrote a blog post with the title The Unreasonable Effectiveness of Recurrent Neural Networks.

(The title is a playful allusion to the paper The Unreasonable Effectiveness of Mathematics in the Natural Sciences written by physicist Eugene Wigner. As Google suggestions indicate, some other authors also used this formula for a catchy title.)

Karpathy’s blog post is not a singular mile stone in the AI literature, it rather references a lot of papers containing prior work he draws on, and indicates trends which he finds promising, e.g. that “The concept of attention is the most interesting recent architectural innovation in neural networks.”

“The concept of attention is the most interesting recent architectural innovation in neural networks.”

It’s also an enjoyable read, because it reflects the author’s passion for playing with tools the deep learning community develops. Apart from stop words, the single most frequent word in the post is “fun” (occurs ten times).

The author is also not the first researcher who experimented with character level language models, but his joy of doing it easily infects the reader to the point of forgetting to question their usefulness. (A language model is an application which predicts the next element of a series, based on the previously predicted elements. A character level language model takes a character stream as input and outputs another character stream. The author of another paper mentions text compression as a “serious” application of character level language models — and makes it clear that other, more effective methods already exist for this.)

A language model is an application which predicts the next element of a series, based on the previously predicted elements.

The following diagram, taken from the posts, depicts the various patterns of input and output of an RNN:

Input vectors (which encode words, characters, images etc. as elements of an input stream) are represented by red rectangles, output vectors (also encoding some output, either a singleton element or elements of an output stream) are represented by blue rectangles, and the RNN’s hidden state (meaning different from the input and output, also a vector) is represented by green rectangles.

The arrows in the diagram represent learned functions (in fact, matrices are learnt, which are multiplied by the vectors and become functions due to the multiplication) which don’t change for the entire prediction process.

The columns of the diagram don’t represent different network elements, but different stages of the prediction process, during which only the contents of the data represented by rectangles change.

The use case of machine translation and language model, collectively also called transduction, is represented by the fourth and fifth pattern. (The difference is that in the fourth pattern the input sequence length can be different from the output sequence length.)

This representation of the RNN structure is quite sketchy. Another diagram from another paper (Effective Approaches to Attention-based Neural Machine Translation) displays the fourth pattern (machine translation) in more detail:

Here it is properly emphasized that the prediction process consists of two phases, an encoding phase (blue rectangles) and a decoding phase (burgundy rectangles). Again, the columns having the same color do not represent different network elements but different stages of the prediction process. The different colors indicate that the encoder component of the network (one set of blue rectangles) can be different from the decoder component (one set of burgundy rectangles) of the network (which is the case, as will will see later, with the Transformer architecture).

The structure and working mechanism of Karpathy’s toy example, the character level language model is represented by the following diagram:

The output layer vectors contain a probability distribution of the possible continuations of the output stream. The input vectors are in the toy example “one-hot” vectors, instead of the word embedding vectors of realistic applications.

The diagram displays the inner state and output vectors of the network before training, after initializing the “W_xh”, “W_hh” and “W_hy” “functions” (in fact matrices) with random values. It is expected that after training the network, in the output vectors the probabilities marked with green will be the highest, because the vectors will represent in this way the expected continuations of the character stream.

To be clear: this toy example is totally useless, its only purpose is to illustrate in general how a language model is expected to work.

In the remainder of the blog post, in the section “Fun with RNNs”, the author shows examples of actually useless but interesting “gibberish generators” (this is how I characterize them, not the author) trained on

Paul Graham essays (Paul Graham is a computer scientist, essayist, founder of Y Combinator and Hacker News),
works of Shakespeare,
the Wikipedia,
an algebraic geometry text book,
the Linux source code and
baby names.

All these generators spit out nonsense text which amusingly remind of their training datasets.

2) RTFM: Attention Is All You Need

I wrote so detailed about Karpathy’s toy character-level language model to prepare you for the real thing, which is the Transformer architecture invented by Google researchers, published two years after Karpathy’s blog post.

The Transformer architecture was described in the paper Attention Is All You Need.

At one level, the title is a playful allusion to pieces of everyday wisdom.

At the factual level of the paper’s topic, it’s meaning is summed up in the abstract like this (structured as a pointed list by me):

[Prior art:] The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism.
[Innovation:] We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
[Benefit:] Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

So the attention mechanism in itself is not the innovation of the authors, the innovation is in the “all you need” part. We will focus in the following paper walkthrough on how this “all you need” innovation leads to better quality, more parallelizable computations and reduced training time.

(Parallelization is a technical detail but an important one: a higher degree of parallelization means horizontal scalability, which means that the computations can be made faster by applying more computing power to them, and utilizing the parallel computation capabilities of GPUs. Which translates into the increased volume of the training set, which translates into the higher quality of the model.)

“Attention” has a special meaning

In the context of Natural Language Processing or NLP research in AI, the term “attention” has a special meaning, different from the everyday use of the word.

The usefulness of attention follows from the fact that natural language texts are not signal streams, but rather serialized graph structures which are represented by linguists as phrase structures or dependency grammars.

Attention can be thought of as the “deserialization” mechanism for the serialized text graphs.

Natural language texts are not signal streams, but rather serialized graph structures. Attention can be thought of as the “deserialization” mechanism for the serialized text graphs.

The key technical term which explains the special meaning of “attention” is “dependency”, which is a key word also in this paper. (To be honest, I cannot tell if the word “dependency” has the same meaning here as in linguistics, but for sure it’s something similar.)

In the context of the challenge at hand, “dependency” and “attention” means the following:

The network is supposed to spit out the result of computation one token at a time (roughly word by word). For computing the next word in the output sentence, often not the next unused word in the input sentence is relevant (see “dependency”), but one or more words before or after it, possibly at a longer distance. The attention mechanism helps the model to focus on the relevant word or words for generating the next output word. Moreover, different words can be relevant from different perspectives. That’s why the Transformer model operates with several attention “heads”. One illustration for this is the following diagram:

Why is the architecture called “Transformer”?

I have no idea. I couldn’t find any clue for this neither in the paper nor in any online source.

When I asked ChatGPT, it pretended to give a definitive answer, but the anser is, unfortunately, not convincing:

The Transformer model architecture is called “transformer” because it utilizes a specific type of attention mechanism called “self-attention” to transform and process input data in a highly parallelizable and efficient manner. (…) The name “transformer” reflects the transformative nature of this attention mechanism in capturing complex patterns and dependencies within sequences. (… Blah, blah, blah…)

If you manage to find out the real answer, please let me know.

The model architecture

This is the famous diagram illustrating the Transformer model architecture. I will structure the remainder of this section by questions you may have about this diagram.

What do the two parts of the diagram illustrate?

The left-hand module is the encoder, the right-hand module is the decoder.

The encoder takes the input sequence and transforms it into a representation called generally Context (the expression is not used in this paper). The Context vector then is passed to the decoder, which generates the output.

The relationship between encoder and decoder can be understood from the diagram above which I also insert here. The blue rectangles represent the encoder, the burgundy rectangles represent the decoder. Please note that the diagram shows an RNN which processes information sequentially, which is not true fro the Transformer, so the diagram does not fully fit to the Transformer architecture:

What is the input to the encoder?

The input is a representation of the entire input sentence. This is the major innovation of the Transformer architecture as opposed to RNNs, where the input is a stream of words, one word at a time.

Why is the input to the decoder “Outputs (shifted right)”?

This can also be understood, mutatis mutandis, from the above diagram. The decoder takes as input not only the context vector from the encoder but also the previously generated output, shifted, meaning that the first element is the “end of sentence” token.

What is the output of the decoder?

It’s a sequence of probability distributions for the elements of the output over the words of the dictionary.

What is input and output embedding?

In NLP applications, words are represented by vectors, which magically encode the meaning of the words. By “magically” I mean that if you look at the numbers contained in the vectors, you cannot attach any meaning to them, still, vectors for words with similar meanings stack together in the vector space, and differences of vector pairs can express analogies.

Word embedding is a complex topic in itself, but has been used for a much longer time than the Transformer architecture. This e.g. is a blog post by Chris Olah from which you can understand what word embeddings are for are how they can be created: Deep Learning, NLP, and Representations.

What is positional encoding?

Positional encoding is a representation of the position of a word in the sentence. It’s needed because the model doesn’t contain any recurrence or convolution.

Positional encoding is a vector having the same length as the word embedding. The vector members are computed by sine and cosine functions of different wavelength, corresponding to word position and vector element index, ad described in section 3.5 of the paper.

The positional encoding vector is added to the word embedding vector of the inputs of the encoder and decoder module.

Why do Multi-Head Attention layers have three-way fork shaped inputs?

The three branches of the fork represent three learned matrices. The matrices are multiplied with the input vector and the products magically encode a Query vector and a set of Key-Value pair vectors. By “magically” I mean, similarly to word embeddings, that the numbers in the vectors won’t convey any meaning in themselves, but the Query vector behaves as if it was expressing a question (e.g. Who, Where, When, Why etc), and the Key-Value pairs behave as if they were expressing potential answers to the question.

The computation of attention for a certain word mimics a query against a key-value store in a differentiable way (differentiability is needed for backpropagation). First the query and the key vector is combined which returns a probability distribution of the relevance of the keys for the query (mimics picking the right key corresponding to the query), than this probability distribution is multiplied with the value vector which returns a weighted combination of the possible values (mimics returning the value corresponding to the chosen key).

The computation of one attention “head” and the ensemble of heads as a multi-head attention layer is represented by these diagrams:

The scaled dot-product attention is almost the same as described in the previous paper Neural Machine Translation by Jointly Learning to Align and Translate, just the scaling factor is added. (This is also a difficult paper, I don’t want to pretend that I fully understand it, but the diagrams can help to understand what’t going on.)

Which components come from where in the decoder multi-head attention layer?

I mean the right side of this part of the diagram:

The answer is that the Query component comes from the decoder itself, and the Key and Value component comes from the encoder.

The intuition for this is that a “question” is asked by the decoder as to what the next word of the output should be, and it tries to find the answer to this in the context outputted by the encoder.

Why does one of the multi-head attention layers in the decoder say “Masked Multi-Head Attention”?

Masking is used during training to pretend that the input is a perfectly but partially translated sentence by masking out the remainder of the input sentence.

What do the “Add & Norm” and the “Feed Forward” layers do?

They add residual connections and normalization in order to improve learning efficiency.

What does “Nx” mean next to the encoder and decoder block?

This means that the same structure is repeated N times. In the case of the Transformer paper N=6.

The first layer combines words, the second layer combines pairs of words, the third layer pairs of pairs of words etc.

[You can ask me:] Where can I find additional insights and material?

You can find additional insights and material all over the Internet, e.g. in this StackExchange thread: What exactly are keys, queries, and values in attention mechanisms? I especially recommend this lecture by prof Pascal Poupart mentioned in one of the answers: CS480/680 Lecture 19: Attention and Transformer Networks.

3) Talk is cheap, show me the code!

You can play around with the Transformer architecture using its source code. The code is now available as part of the Trax library which is maintained by the Google Brain team.

Translation with varying temperature

You can try the English-German translation with a pre-trained Transformer network in a Google Colab notebook. There is one parameter you can play with, “temperature”. Temperature is actually a parameter of the softmax layer. Higher temperature yields more diverse results.

This English input sentence e.g.: “It is nice to learn new things today!” is translated differently into German with different temperatures:

With the default temperature=0.0 value it will produce: “Es ist schön, heute neue Dinge zu lernen!”
With temperature=0.5, the translation will be: “Es ist schön, heute Neues zu lernen!”

As you can see, the translation with higher temperature follows less strictly the words of the original (doesn’t contain “Dinge” which is the equivalent of “things”) but sounds more German.

Implementation source code

The source code of the Transformer model can be found in this file in the Trax Github repo.

The Transformer function returns the following model, where the comments mark the encoder and decoder block :

return tl.Serial(
      tl.Select([0, 1, 1]),  # Copies decoder tokens for use in loss.

      # Encode.
      tl.Branch([], tl.PaddingMask()),  # tok_e masks tok_d tok_d
      _Encoder(),

      # Decode.
      tl.Select([2, 1, 0]),  # Re-orders inputs: tok_d masks vec_e .....
      tl.ShiftRight(mode=mode),
      out_embedder,
      _Dropout(),
      tl.PositionalEncoding(max_len=max_len, mode=mode),
      tl.Branch([], tl.EncoderDecoderMask()),  # vec_d masks ..... .....
      [_EncDecBlock() for _ in range(n_decoder_layers)],
      tl.LayerNorm(),
      tl.Select([0], n_in=3),  # Drops masks and encoding vectors.

      # Map vectors to match output vocab size.
      tl.Dense(output_vocab_size),
  )

The Transformer layers are combined with the Serial and Branch combinators for sequential and parallel computations respectively.

You can dig deep in the source code to understand the layers. E.g. the encoder block is defined like this (but as you can see, you have to dig even deeper to find out what _EncBlock is etc.):

def _Encoder():
    encoder = tl.Serial(
        in_embedder,
        _Dropout(),
        tl.PositionalEncoding(max_len=max_len, mode=encoder_mode),
        [_EncBlock() for _ in range(n_encoder_layers)],
        tl.LayerNorm(),
    )
    return tl.Cache(encoder) if mode == 'predict' else encoder

A documentation for the Trax layers can be found here.