Nikita Agarwal

Jan 95 min read

Transformers: Heart of Generative AI - Simply Explained

Generative AI models are mesmerizing silicon creators that are going to take over the world - this is what it feels like when asking Chat-GPT to write a Haiku, which even some of the smarter adults can’t (at least not a decent one). But what are they? Are they the silicon versions of human brains with incredibly big memory? Or is it someone sitting behind the screen making a fool of the world? Let’s try to answer that question today.

For a long time, humans have been trying to build artificial intelligence that performs specific tasks like classifying an image as a cat or dog image, marking a review as positive or negative, summarizing text, etc. GPT-3 was the first (good) general artificial intelligence that could do all the above tasks and more.

But how does it work? Remember the questions we used to get in school - fill in the blanks

Ram loves to ski on snow _____

Our task was to read the sentence and figure out the missing word. Gen AI does exactly the same thing, given some piece of text, it figures out the next word.

Let’s take an example,

Principle behind Generative AI - next word prediction

Here the Gen-AI model just keeps filling the blank again and again and again (till it thinks this much text is enough).

Now the question arises - how does the model know what the next word should be? This is where, my friends, the Transformers, come. Transformers is a type of model, specifically designed for next-word prediction. Transformers were born in 2017 and since then they have been used in various tasks like language translation, image generation, text summarization, etc. Gen AI is nothing but a Transformer that has been trained on terabytes of data, or to put simply, that has scrolled through the whole of the internet.

What is Transformer

Transformer is a type of deep neural network, initially developed to solve problems like language-to-language translation. It takes a token sequence as input and gives another sequence as output. The specialty of Transformers is the ability to let far-away words (technically called tokens) in input to affect the output. Transformer uses a mechanism called “Attention” to do this, attention is put simply, how much attention to be paid to a word in input while creating the output.

Other models like Recurrent Neural Network (RNN), and Long-Short-Term-Memory (LSTM), also have the ability to let faraway words influence the output, but transformer does this much better than these models.

Some common types of models built with transformers are -

BERT - Bidirectional Encoder Representation from Transformers

GPT - Generative pre-Training

Attention

Let’s now try to understand Attention. Attention is a mechanism for transformer models to reference input and already seen output to generate the next output token. Pay attention to “already seen input and output”. What it means is that the input and the output produced till a given time, both influence the new output.

Input -> As aliens entered our planet

Time	Influential Tokens	New Output token	Output Seq
t=1	Input - planet	and	and
t=2	Input - entered, planet	began	and began
t=3	Output - began	to	and began to
t=4	Input - aliens, planet	colonize	and began to colonize

Here we see that the at time 1, the token planet influences the model, and the output token produced is and, and so now the output is “and”. Then at time = 2, the tokens entered and planet influence and the token began is products. The output is now “and began”, at time = 3, the output token began influences and the token to is produced and so on.

How the model knows which token to reference is learning during training with back-propagation.

Why Transformers

There were many models before Transformers that also have the abilities similar to Attention. Then why was Transformers so popular? What’s special about them?

The models that came before Transformers all suffered from the same problem. They processed the input sequence token-by-token, i.e. they would take the first token of input and create the first token of output, then take second token of input and existing output and create the second token of output.

As a token’s distance from current token increases, it becomes less influential in generating the output token, i.e. the influence of a token decreases as its distance increases. Therefore a relevant token, far-back in the sequence, will not be able to affect the output significantly.

Eg, “As aliens entered our planet and began to colonize ….”

The influence of “aliens” on generating “colonize” will be less and probably insufficient, given it is far from it.

With the Attention mechanism, the distance of the token from current token does not play a role in its influence, hence a token far away, but relevant to generate current token will be able to significantly influence the generation (in theory, with infinite resources).

Content warning : Beyond this point the content contains technical details. Viewer discretion is advised. Read at your own risk if you want to be among select few who understand AI.

How Transformers work

Although GenAI or Transformers feels like an artificial brain is created by some geek computer scientist, its actually just some manipulation of numerically represented text done billions of times. Let’s dive deep into how Transformers actually work. But note that although we can understand how they work, everyone in the world has only a limited understanding of why they work. I will try to explain the why as much as possible.

This diagram is everywhere since Chat-GPT. There is even a t-shirt with it.We will breakdown this diagram in this article

A transformer has two parts - Encoder, Decoder

Encoder - Takes input sequence as input and outputs an abstract continuous representation that contains all the learned information of that input. In simple words - encoder remembers how the words were related to each other during training and uses it to create an interim representation of input that captures which words should influence which words. Multiple Encoders are stacked up to improve the performance.
Decoder - Takes encoder’s output as input and generates output tokens one-by-one while also using output generated till then as input. To put simple - decoder remembers how words in output are related to each other and to input words during training. It uses this information to generate the output word-by-word. Multiple Decoders are stacked up, with encoder and previous decoder as output to learn different combinations of attention and this makes predictions better.

Encoder

An encoder first converts the input to its numerical representation (Input Embedding), as the models only understands numbers, not text, then decides how much influence each input word has on every input word (Multi-headed attention), then finally improves on these numbers (enrich attention).

Now we will look at what each of these steps are in detail.

This indicates that when decoder processes the token “planet”, the token “entered” should be 3 times more influential than the token “As”.

Step 1: Input Embedding

Step 2: Self-Attention

Step 3: Enrich Attention

Why Residual Connection and Normalisation? Residual Connections help the network train. Normalization reduced the training time.

Decoder

The decoder generates one token at a time of output and keeps generating words till a special token called end of file is produced. It ties together the input, output generated till then and generates a new output token.

It first creates the numerical representation of the output generated till then, then generates how much influence each output token has on every other output token. Then it calculates how much attention each output token should put on every input token using the encoder's output. Lastly some feed-forward neural networks generate the new output token. The probabilities of the output token are then made more reliable using softmax.

Let’s look at each step in detail.

Step 1: Output Embedding

Step 2: Multi-Headed Attention for Output

Step 3 : Layer Normalization and addition

Step 4 : Multi-headed attention combining input and output

Step 5 : Sequence generation

Conclusion

Transformers can be summarised as mathematical models that learn the relations between words and uses then to connect input and output to predict the next token.

Now that you have learned Transformers, you can take a stab at understanding Generative AI.

And hopefully wear the tshirt with pride now 👕🖖

Complex Concepts
Simply Explained