Anya Learns Transformers From Scratch!

”Waku waku! Papa says I need to learn about transformers to be a better student at Eden Academy!”

Hey there, fellow code enthusiasts! Today, we’re joining Anya Forger on her most challenging mission yet: understanding how transformers work from scratch. No, not the robots — the neural networks that power ChatGPT, BERT, and all those fancy AI models!

”But Anya doesn’t understand big words!”

Don’t worry, Anya. We’ll break it down together, piece by piece. Let’s start our adventure!

Chapter 1: Input Embeddings — Teaching Anya Her First Words 📚

”What’s an embedding? Is it like hiding under the bed?”

Not quite, Anya! Think of embeddings like this: imagine you have a secret codebook where each word gets a special spy code. Instead of saying “peanuts,” you might say “512-dimensional vector #42”!

So here’s the deal with input embeddings: they allow us to convert our original sentence into input IDs (which are just the positions of each token inside our vocabulary). Then these input IDs get converted to embeddings using an embedding layer — typically 512 dimensions — and this is a trainable layer that learns during training!

import torch 
import torch.nn as nn
import math
import torch.nn.functional as F

class InputEmbeddings(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model = d_model  # dimension of embeddings, typically 512
        self.vocab_size = vocab_size  # size of vocabulary
        # The embedding layer is a weight matrix of dimension (vocab_size, d_model)
        # Think of it as a giant lookup table where each word has its own row of 512 numbers
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        # x contains token ids like [101, 2023, 456] where each number represents a word
        return self.embedding(x)*math.sqrt(self.d_model)

Notice that we multiply the embeddings by the square root of the dimension? That’s to normalize them! Without this scaling, the values might be too small and training could become unstable.

The key formula for input embeddings:

”So each word becomes a bunch of numbers? Like my test scores but bigger?”

Exactly! Each word becomes a vector of 512 numbers. And we multiply by the square root to make sure the numbers aren’t too big or too small — like how Bond-man always uses just the right amount of spy gadgets!

Chapter 2: Positional Encoding — Teaching Anya About Order 🎯

”But wait! How does the transformer know if ‘Anya loves peanuts’ is different from ‘Peanuts love Anya’?”

Great question! That’s where positional encoding comes in. It’s like giving each word a secret mission number!

Here’s the thing: our original sentence doesn’t have any order information, and we need to tell the model about the order of tokens. We do this by adding positional encoding to the input embeddings. The positional encoding is a vector of the same dimension as the input embeddings, and we use sinusoidal functions to generate it!

class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, seq_len: int, dropout: float) -> None:
        super().__init__()
        self.d_model = d_model
        self.seq_len = seq_len  # maximum sequence length we can handle
        self.dropout = nn.Dropout(dropout)

        # create a matrix of shape (seq_len, d_model)
        pe = torch.zeros(seq_len, d_model)
        
        # create a vector of shape (seq_len, 1)
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)

Now here comes the fun math part! The denominator should be 10000^(2i/d_model). We use exponentials to compute this. When we do torch.arange(0, d_model, 2), we get [0, 2, 4, 6, …] for the even dimensions.

Let me walk you through the mathematical magic happening here:

We start with the expression exp(2i*(-log(10000.0)/d_model))
Rearranging: exp((-2i/d_model)*log(10000.0))
Using the log power rule b*log(a) = log(a^b): exp(log(10000.0)^(-2i/d_model))
Simplifying: 10000.0^(-2i/d_model) = 1/10000.0^(2i/d_model)

        div_term = torch.exp(torch.arange(0, d_model, 2).float()* math.log(10000.0)/d_model)
        
        # apply sin to even indices (2i)
        pe[:, 0::2] = torch.sin(position*div_term)
        # apply cos to odd indices (2i+1)
        pe[:, 1::2] = torch.cos(position*div_term)
        
        # add batch dimension for batch processing
        pe = pe.unsqueeze(0)
        # register as buffer - saved with model but not trained
        self.register_buffer('pe', pe)

    def forward(self, x):
        # add positional encoding to input embeddings
        # requires_grad_(False) ensures it's not part of gradient computation
        x = x + (self.pe[:, :x.shape[1], :]).requires_grad_(False)
        return self.dropout(x)

The magical positional encoding formulas:

For even dimensions (2i):

For odd dimensions (2i+1):

”Sine and cosine? Like the wavy lines Papa draws when he’s doing his psychiatrist work?”

Yes! The sine and cosine waves help the model understand position in a special way. Even positions get sine waves, odd positions get cosine waves. It’s like giving each word a unique spy badge!

sine and cosine graph — heat maps

Chapter 3: Layer Normalization — Keeping Anya’s Thoughts Balanced ⚖️

”Sometimes Anya’s brain gets too excited about peanuts and forgets everything else!”

That’s exactly why we need layer normalization! It helps keep all the numbers balanced.

Layer normalization is a technique that normalizes inputs across features (the last dimension) instead of across the batch like batch norm does. The goal? Stabilize training and ensure inputs to the next layer have zero mean and unit variance. Unlike batchnorm, layernorm works better for NLP models where batch size might vary or be small.

class LayerNormalization(nn.Module):
    def __init__(self, eps: float = 1e-6):
        super().__init__()
        # epsilon prevents division by zero if std is very small
        self.eps = eps

        # learnable parameters for scaling and shifting
        self.alpha = nn.Parameter(torch.ones(1))  # scale
        self.bias = nn.Parameter(torch.zeros(1))  # shift

    def forward(self, x):
        # for each token in the sequence, normalize its feature vector
        mean = x.mean(dim=-1, keepdim=True)
        std = x.std(dim=-1, keepdim=True)

        # standard normalization with learnable parameters
        normalized = (x - mean) / torch.sqrt(std + self.eps)
        return self.alpha * normalized + self.bias

The layer normalization formula:

Chapter 4: Multi-Head Attention — Anya’s Mind-Reading Powers! 🧠

”Anya can read minds! Is this like that?”

Actually, yes! Multi-head attention is like having multiple Anyas, each reading different thoughts at the same time!

Multi-head attention is THE mechanism that allows the model to jointly attend to information from different positions in the input sequence. It’s a key component in the transformer architecture, capturing relationships between different parts of the input. How? By splitting the input embeddings into multiple heads, applying attention to each head, then concatenating the results!

class MultiHeadAttention(nn.Module): 
    def __init__(self, d_model: int, num_heads: int, dropout: float):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads

        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        # each head gets this many dimensions
        self.d_k = d_model // num_heads
        
        # linear projections for Q, K, V
        self.w_q = nn.Linear(d_model, d_model, bias=False)
        self.w_k = nn.Linear(d_model, d_model, bias=False)
        self.w_v = nn.Linear(d_model, d_model, bias=False)

        # final linear layer after concatenating heads
        self.linear = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(dropout)

”So Query is like asking a question, Key is like the secret code, and Value is the answer?”

attention-mechanism

Brilliant deduction, Anya! Now let’s see how we split these into multiple heads:

Here’s what happens next: we have our Q, K, and V vectors, and they need to be split into multiple heads. Let’s say we have 4 heads — each head would have dimension (seq_len, d_k) where d_k = d_model/num_heads = 512/4 = 128.

multi head attention

The attention formula for each head is:

Once we compute attention for each head, we concatenate all the results. Why multiply by W_O at the end? Because we want to project the multi-head attention output back to the same dimension as the input embeddings!

def forward(self, x, mask=None, kv_input=None):
        # for self-attention: x is used for Q, K, V
        # for cross-attention: x is used for Q, kv_input is used for K, V
        
        if kv_input is None:
            kv_input = x  # self-attention
        
        batch_size, seq_len, d_model = x.shape
        kv_batch_size, kv_seq_len, kv_d_model = kv_input.shape

        # project to Q, K, V
        q_prime = self.w_q(x)
        k_prime = self.w_k(kv_input)
        v_prime = self.w_v(kv_input)

Now we’re multiplying weight matrices of dimension (d_model, d_model) with input embeddings of dimension (batch_size, seq_len, d_model). We could have used a single weight matrix of dimension (d_model, 3*d_model) and split the output into 3 parts, but using individual matrices makes it clearer!

Time to split into multiple heads! We reshape from (batch_size, seq_len, d_model) to (batch_size, seq_len, num_heads, d_k), then rearrange to (batch_size, num_heads, seq_len, d_k). Why? Because we want to compute attention for each head separately!

        # split into multiple heads and rearrange dimensions
        q_heads = q_prime.view(batch_size, seq_len, self.num_heads, self.d_k).permute(0, 2, 1, 3)
        k_heads = k_prime.view(kv_batch_size, kv_seq_len, self.num_heads, self.d_k).permute(0, 2, 1, 3)
        v_heads = v_prime.view(kv_batch_size, kv_seq_len, self.num_heads, self.d_k).permute(0, 2, 1, 3)

        # compute attention scores
        # scale by sqrt(d_k) to prevent softmax from becoming too peaked
        attention_scores = torch.matmul(q_heads, k_heads.transpose(-2, -1))/math.sqrt(self.d_k)

        # apply mask if provided
        if mask is not None:
            if mask.dim() == 3:
                mask = mask.unsqueeze(1)  # add head dimension
            attention_scores = attention_scores.masked_fill(mask == 0, -1e9)

        # convert to probabilities
        attention_weights = torch.softmax(attention_scores, dim=-1)
        attention_weights = self.dropout(attention_weights)

        # apply attention to values
        attention_output = torch.matmul(attention_weights, v_heads)

        # concatenate heads and project
        attention_output = attention_output.permute(0, 2, 1, 3).contiguous().view(batch_size, seq_len, d_model)
        attention_output = self.linear(attention_output)

        return attention_output

The attention formula:

And for multi-head attention:

Where each head is:

”Waku waku! So it’s like having 8 mini-Anyas all reading different thoughts and then combining what they learned!”

Chapter 5: Feed Forward Network — Anya’s Brain Processing 🧮

”After reading all those minds, Anya needs to think about what she learned!”

That’s exactly what the feed forward network does!

The feed forward network is a simple two-layer neural network applied to each position in the sequence. It has a hidden layer of dimension d_ff (typically 4 times d_model) with ReLU activation.

class FeedForwardNetwork(nn.Module):
    def __init__(self, d_model: int, d_ff: int, dropout: float):
        super().__init__()
        self.d_model = d_model
        self.d_ff = d_ff  # typically 4 * d_model
        self.dropout = nn.Dropout(dropout)

        # first linear layer expands from d_model to d_ff
        self.linear_1 = nn.Linear(d_model, d_ff)
        
        # second linear layer compresses back to d_model
        self.linear_2 = nn.Linear(d_ff, d_model)

        self.layer_norm = LayerNormalization()
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # expand, apply ReLU, compress back
        linear_1 = self.linear_1(x)
        linear_1 = F.relu(linear_1)  # ReLU(x) = max(0, x)
        linear_1 = self.dropout(linear_1)
        linear_2 = self.linear_2(linear_1)
        return linear_2

The feed-forward network formula:

Chapter 6: The Encoder — Papa Loid’s Secret Mission 🕵️

”The encoder is like Papa when he’s on a spy mission?”

Yes! The encoder takes information and encodes it into secret spy messages!

The encoder block combines multi-head attention with a feed forward network, using residual connections and layer normalization after each sub-layer.

encoder block

class EncoderBlock(nn.Module):
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float):
        super().__init__()
        self.attention = MultiHeadAttention(d_model, num_heads, dropout)
        self.feed_forward = FeedForwardNetwork(d_model, d_ff, dropout)

        # two layer norms - one after attention, one after FFN
        self.layer_norm_1= LayerNormalization()
        self.layer_norm_2 = LayerNormalization()

        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x, mask=None):
        # multi-head attention with residual connection
        # pattern: output = LayerNorm(x + Sublayer(x))
        attention_output = self.attention(x, mask)
        x = self.layer_norm_1(x + self.dropout(attention_output))

        # feed forward with residual connection
        feed_forward_output = self.feed_forward(x)
        x = self.layer_norm_2(x + self.dropout(feed_forward_output))

        return x

The encoder is a stack of encoder blocks (typically 6 in the original transformer). Each block refines the representation further!

class Encoder(nn.Module):
    def __init__(self, vocab_size: int, d_model: int, num_heads: int, d_ff: int, dropout: float, num_encoder_blocks: int, seq_len: int):
        super().__init__()
        self.embedding = InputEmbeddings(d_model, vocab_size)
        self.positional_encoding = PositionalEncoding(d_model, seq_len, dropout)
        
        # stack of encoder blocks
        self.encoder_blocks = nn.ModuleList(
            [EncoderBlock(d_model, num_heads, d_ff, dropout) for _ in range(num_encoder_blocks)]
        )
        
        self.layer_norm = LayerNormalization()

    def forward(self, x, mask=None):
        # convert tokens to embeddings and add position info
        x = self.embedding(x)
        x = self.positional_encoding(x)
        
        # pass through each encoder block
        for encoder_block in self.encoder_blocks:
            x = encoder_block(x, mask)
            
        x = self.layer_norm(x)
        return x

encoder-decoder stack

Chapter 7: The Decoder — Mama Yor’s Translation Skills 🗡️

”The decoder is like Mama when she’s translating assassin orders into normal words?”

Perfect analogy, Anya! The decoder takes the encoded spy messages and turns them into something we can understand!

The decoder block is more complex than the encoder — it has THREE sub-layers! First, masked self-attention (so we can’t peek at future tokens), then cross-attention with the encoder output (to focus on relevant parts of the input), and finally a feed forward network.

decoder block

class DecoderBlock(nn.Module):
    def __init__(self, d_model: int, num_heads: int, d_ff: int, dropout: float):
        super().__init__()
        # three attention/FFN layers
        self.self_attention = MultiHeadAttention(d_model, num_heads, dropout)
        self.cross_attention = MultiHeadAttention(d_model, num_heads, dropout)
        self.feed_forward = FeedForwardNetwork(d_model, d_ff, dropout)

        # three layer norms
        self.layer_norm_1 = LayerNormalization()
        self.layer_norm_2 = LayerNormalization()
        self.layer_norm_3 = LayerNormalization()

        self.dropout = nn.Dropout(dropout)

    def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
        # 1. masked self-attention
        attention_output = self.self_attention(x, tgt_mask)
        x = self.layer_norm_1(x + self.dropout(attention_output))

        # 2. cross-attention with encoder
        # Q from decoder, K and V from encoder
        attention_output = self.cross_attention(x, src_mask, encoder_output)
        x = self.layer_norm_2(x + self.dropout(attention_output))

        # 3. feed forward
        feed_forward_output = self.feed_forward(x)
        x = self.layer_norm_3(x + self.dropout(feed_forward_output))
        
        return x

The decoder stacks these blocks just like the encoder:

class Decoder(nn.Module):
    def __init__(self, vocab_size: int, d_model: int, num_heads: int, d_ff: int, dropout: float, num_decoder_blocks: int, seq_len: int):
        super().__init__()
        self.embedding = InputEmbeddings(d_model, vocab_size)
        self.positional_encoding = PositionalEncoding(d_model, seq_len, dropout)
        
        self.decoder_blocks = nn.ModuleList(
            [DecoderBlock(d_model, num_heads, d_ff, dropout) for _ in range(num_decoder_blocks)]
        )   
        
        self.layer_norm = LayerNormalization()

    def forward(self, x, encoder_output, src_mask=None, tgt_mask=None):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        
        for decoder_block in self.decoder_blocks:
            x = decoder_block(x, encoder_output, src_mask, tgt_mask)
            
        x = self.layer_norm(x)
        return x

Chapter 8: The Complete Transformer — Operation Strix! ✨

”So we combine everything like Operation Strix?”

transformer architecture

Exactly! Just like how your family came together for Operation Strix, all these components work together to create the mighty Transformer!

First, let’s create our masking functions. The padding mask ensures we don’t attend to padding tokens, while the look-ahead mask prevents the decoder from cheating by looking at future tokens:

def create_padding_mask(seq, pad_idx=0):
    # returns True for real tokens, False for padding
    return (seq != pad_idx).unsqueeze(1).unsqueeze(2)

def create_look_ahead_mask(size):
    # upper triangular matrix - can't look at future positions
    mask = torch.triu(torch.ones(size, size), diagonal=1)
    return mask == 0

Now for the main transformer class that brings it all together:

class Transformer(nn.Module):
    def __init__(self, 
                 src_vocab_size: int,
                 tgt_vocab_size: int,
                 d_model: int = 512,
                 num_heads: int = 8,
                 d_ff: int = 2048,
                 num_encoder_blocks: int = 6,
                 num_decoder_blocks: int = 6,
                 dropout: float = 0.1,
                 seq_len: int = 512):
        super().__init__()
        
        self.encoder = Encoder(src_vocab_size, d_model, num_heads, d_ff, dropout, num_encoder_blocks, seq_len)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_heads, d_ff, dropout, num_decoder_blocks, seq_len)
        
        # output projection to vocabulary size
        self.output_projection = nn.Linear(d_model, tgt_vocab_size)
        
    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        # create masks if not provided
        if src_mask is None:
            src_mask = create_padding_mask(src)
        if tgt_mask is None:
            tgt_mask = create_padding_mask(tgt) & create_look_ahead_mask(tgt.size(1))
        
        # encode source
        encoder_output = self.encoder(src, src_mask)
        
        # decode target
        decoder_output = self.decoder(tgt, encoder_output, src_mask, tgt_mask)
        
        # project to vocabulary
        output = self.output_projection(decoder_output)
        
        return output
    
    def forward_with_softmax(self, src, tgt, src_mask=None, tgt_mask=None):
        # for when you want probabilities instead of logits
        logits = self.forward(src, tgt, src_mask, tgt_mask)
        probabilities = torch.softmax(logits, dim=-1)
        return probabilities

Let’s See It In Action! 🎬

”Show Anya how it works with peanuts!”

Here’s a simple example showing how to use our transformer:

def simple_translation_example():
    # create vocabulary mappings -> english to hindi
    src_vocab = {"<PAD>": 0, "<SOS>": 1, "<EOS>": 2, "hello": 3, "world": 4, "how": 5, "are": 6, "you": 7}
    tgt_vocab = {"<PAD>": 0, "<SOS>": 1, "<EOS>": 2, "namaste": 3, "vishva": 4, "aap": 5, "kaise": 6, "ho": 7}
    
    # create model
    model = Transformer(
        src_vocab_size=len(src_vocab),
        tgt_vocab_size=len(tgt_vocab),
        d_model=128,
        num_heads=4,
        d_ff=512,
        num_encoder_blocks=2,
        num_decoder_blocks=2,
        dropout=0.1,
        seq_len=20
    )
    
    # example sentences
    src_sentences = [
        ["hello", "world"],
        ["how", "are", "you"]
    ]
    
    # convert to token IDs with special tokens
    src_tokens = []
    for sentence in src_sentences:
        tokens = [src_vocab["<SOS>"]] + [src_vocab[word] for word in sentence] + [src_vocab["<EOS>"]]
        src_tokens.append(tokens)
    
    # pad sequences for batch processing
    max_len = max(len(tokens) for tokens in src_tokens)
    src_padded = []
    for tokens in src_tokens:
        padded = tokens + [src_vocab["<PAD>"]] * (max_len - len(tokens))
        src_padded.append(padded)
    
    src_tensor = torch.tensor(src_padded)
    
    # model forward pass
    model.eval()
    with torch.no_grad():
        tgt_dummy = torch.randint(0, len(tgt_vocab), (len(src_sentences), max_len))
        output = model(src_tensor, tgt_dummy)
        print(f"Model output shape: {output.shape}")

Training Our Spy Network 🎯

”How do we train it to be a better spy?”

Just like how you practice at Eden Academy, we train our transformer with lots of examples!

def training_example():
    model = Transformer(
        src_vocab_size=1000,
        tgt_vocab_size=1000,
        d_model=256,
        num_heads=4,
        d_ff=1024,
        num_encoder_blocks=3,
        num_decoder_blocks=3,
        dropout=0.1,
        seq_len=50
    )
    
    # adam optimizer works well for transformers
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    # ignore_index=0 means don't calculate loss on padding tokens
    criterion = nn.CrossEntropyLoss(ignore_index=0)
    
    model.train()
    num_epochs = 5
    
    for epoch in range(num_epochs):
        total_loss = 0
        num_batches = 10
        
        for batch in range(num_batches):
            batch_size = 4
            src_seq_len = 20
            tgt_seq_len = 15
            
            src = torch.randint(1, 1000, (batch_size, src_seq_len))
            tgt_input = torch.randint(1, 1000, (batch_size, tgt_seq_len))
            tgt_output = torch.randint(1, 1000, (batch_size, tgt_seq_len))
            
            optimizer.zero_grad()
            output = model(src, tgt_input)
            
            # reshape for cross entropy loss
            loss = criterion(output.view(-1, 1000), tgt_output.view(-1))
            
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        avg_loss = total_loss / num_batches
        print(f"Epoch {epoch+1}/{num_epochs}, Average Loss: {avg_loss:.4f}")

Inference: Generating Translations 🔮

”How does the transformer actually translate?”

During inference, we generate one token at a time!

def inference_example():
    model = Transformer(
        src_vocab_size=1000,
        tgt_vocab_size=1000,
        d_model=256,
        num_heads=4,
        d_ff=1024,
        num_encoder_blocks=3,
        num_decoder_blocks=3,
        dropout=0.0,  # no dropout during inference
        seq_len=50
    )
    
    model.eval()
    
    src = torch.tensor([[1, 2, 3, 4, 5]])  # Single sequence
    
    # start with <SOS> token
    tgt = torch.tensor([[1]])
    
    with torch.no_grad():
        # encode source once
        src_mask = create_padding_mask(src)
        encoder_output = model.encoder(src, src_mask)
        
        # generate token by token
        for _ in range(10):
            tgt_mask = create_padding_mask(tgt) & create_look_ahead_mask(tgt.size(1))
            
            decoder_output = model.decoder(tgt, encoder_output, src_mask, tgt_mask)
            output = model.output_projection(decoder_output)
            
            # take highest probability token
            next_token = output[:, -1, :].argmax(dim=-1).unsqueeze(1)
            tgt = torch.cat([tgt, next_token], dim=1)
            
            # stop at <EOS>
            if next_token.item() == 2:
                break

The generation process formula:

The Grand Finale: Anya’s Understanding! 🎉

”Anya understands now! The transformer is like a big spy organization where:

Input embeddings are secret codes for words
Positional encoding tells us the mission order
Attention is like mind-reading to understand connections
Feed forward is thinking about what we learned
Encoder is Papa encoding spy messages
Decoder is Mama translating them back!”

You’ve got it, Anya! The transformer architecture has revolutionized how we process language, and now you understand it from scratch!

Key Takeaways 🌟

The Transformer architecture revolutionized NLP by:

1. Parallelization: Unlike RNNs, all positions can be processed simultaneously

2. Long-range dependencies: Attention can connect distant words directly

3. Interpretability: Attention weights show what the model is “looking at”

The complete flow:

Conclusion: Mission Complete! 🌟

”Waku waku! Anya learned transformers! Now Anya can be an AI spy like Papa!”

Congratulations! You’ve just learned how transformers work from scratch, with Anya as your guide. From understanding embeddings to building the complete architecture, you now have the knowledge to create your own transformer models!

Now go forth and build amazing AI models! And don’t forget your peanuts! 🥜