Karlo Šmid Blog

An unexpected display of art

2026-03-29T08:00:00+00:00

TL;DR

How we had the privilege of attending a Testival meetup and seeing an unexpected display of art.

The meetup

Zeljko organized this Testival #79 at short notice. It was our second discussion about the book Taking Testing Seriously. There were three of us, and Bojan was the third attendee. Zeljko was the only one who finished the book. He had 25 pages of Kindle notes. He noticed one interesting thing: all of his highlights were from the first two chapters.

The book also contains interview chapters, and with all respect to the interviewees, Zeljko noticed that this was the weakest part of the book. It seems like James and Michael were missing some pages and then decided to add interviews with people who helped shape the Rapid Software Testing methodology.

Bojan was asking for advice on which MacBook he should buy. For him, the important feature is display size, because he uses a laptop instead of a TV set. The problem is that each extra inch increases the price significantly. My only advice was not to buy a Mac with a Touch Bar, which is no longer in production.

I also learned about a free app One Album A Day that regularly suggests an essential music album that you should listen to before you die. A very interesting idea presented by Bojan.

The summer concert season in Zagreb is coming on pretty strong. I’ll be attending shows by Sepultura, The Offspring, A Perfect Circle, Anthrax, Voivod, Biohazard, Slaughter to Prevail, and Fear Factory.

And a shout-out to Zeljko, who moved our testival.eu WordPress export to Jekyll on GitHub Pages, without images, which was my mistake. This is a much better option because it can be easily moved to any provider, and hosting on GitHub is free.

The art

And now the most interesting part. Bojan had to go home because his wife sent him a message saying that he had a surprise waiting for Father’s Day the next day. Here is the art created by two six-year-olds, in their original font, with some help from their mother.

Next meetup

The next meetup, Testival #80, is scheduled for April 15, 2026, at Super, where they will talk about their QA process and how it is smartly supported by LLMs.

The Heart of an LLM: Attention Mechanism in Elixir

2026-03-28T09:00:00+00:00

TL;DR

This post is based on Chapter 3 of Build a LLM from Scratch by Sebastian Raschka, with one twist: all Python examples are rewritten in Elixir. We are building LLMs attention mechanism. Attention mechanism is second part of stage 1. We have already prepared input text data. Attention mechanism help LLM to predict next token. We will implement four attention mechanism:

simplified self-attention
self-attention
casual attention
multi-head attention

The problem with modeling long sequences

The heart of an LLM is its attention mechanism. As an introduction, we should first go back in time, when there was no attention mechanism. Let’s try to translate Croatian to English word by word:

Karlo	u	potupnosti	razumije	kako	napraviti	LLM	model	iz	nule

Karlo	in	completely	understands	how	make	LLM	model	from	scratch

The words are in the same order in Croatian and English, so that is not the issue. The issue is sentence context, where u potpunosti translates to one word, completely:

Karlo	u potupnosti	razumije	kako	napraviti	LLM	model	iz	nule

Karlo	completely	understands	how	to make	LLM	model	from	scratch

Now let’s bring back neural networks with an encoder/decoder architecture, called an RNN (recurrent neural network). The encoder first processes the input text, and that output becomes the input for the decoder module. The encoder processes text token by token, and at each step it updates its hidden state. The hidden state holds input context. When it is done, the input and hidden state go to the decoder, which translates token by token using the encoder hidden state, and on each token it updates its own hidden state. This is the core idea; we do not need full RNN details here. We can think of the hidden state as encoding vectors from Chapter 2.
Then the encoder/decoder architecture hit a problem that eventually gave birth to LLM transformers. The decoder has access only to the hidden state of the current input, and it does not know anything about previous inputs (sentences). The result is a loss of context during translation, especially in sentences where dependencies span long distances.

Capturing data dependencies with attention mechanisms

The shortcoming of the encoder/decoder architecture was that we had to store all encoded input in the hidden state. To overcome that, the Bahdanau attention mechanism was developed. The idea was to give the decoder selective access to the encoded input during the decoding step. Selective means that some encoded input tokens are more important for the current output token. This inspired the self-attention mechanism in transformer architecture. So, the self-attention mechanism helps an LLM determine which part of the input sequence is more relevant for a particular input position.

Attending to different parts of the input with self-attention

This is one of the toughest parts of the book.

What does “self” in self-attention mean? It means that a transformer creates attention weights using only the input sequence; the input sequence attends to itself. In encoder-decoder architectures, the output depends on information from the input.

Let’s break this down into chunks!

A simple self-attention mechanism

First, we will use no trainable weights; this is our first simplification. Let’s work through this test:

 test "simple self-attention mechanism" do

    # our input has 6 tokens Your journey starts with one step. Each token has 3 dimensions.

    inputs =
      Nx.tensor(
        [
          # Your (x^1)
          [0.43, 0.15, 0.89],
          # journey (x^2)
          [0.55, 0.87, 0.66],
          # starts (x^3)
          [0.57, 0.85, 0.64],
          # with (x^4)
          [0.22, 0.58, 0.33],
          # one (x^5)
          [0.77, 0.25, 0.10],
          # step (x^6)
          [0.05, 0.80, 0.55]
        ],
        type: {:f, 32}
      )

    assert Nx.shape(inputs) == {6, 3}

    # In Nx we read this like this: slice along axis 0 (rows), starting from index 1 with length of 1
    # which gives us query x2: [0.55, 0.87, 0.66] 

    query =
      inputs
      |> Nx.slice_along_axis(1, 1, axis: 0)
      |> Nx.squeeze(axes: [0])

    assert Nx.shape(query) == {3}

    # we are doting {6, 3} dot {3}, condition is that we can only dot over axis that are same, so 3 with 3
    # so each row from inputs is multiplied with query. That means to multiply elements of corresponding columns and then sum those values
    # for first row: score_0 = 0.43*0.55 + 0.15*0.87 + 0.89*0.66

    attn_scores_2 = Nx.dot(inputs, [1], query, [0])

    # result is vector with 6 values

    assert Nx.shape(attn_scores_2) == {6}

    expected_attn_scores_2 =
      Nx.tensor(
        [
          0.9544000029563904,
          1.4950000047683716,
          1.4754000902175903,
          0.8434000015258789,
          0.7070000171661377,
          1.0865000486373901
        ],
        type: :f32
      )

    assert Nx.all_close(attn_scores_2, expected_attn_scores_2, atol: 1.0e-6) |> Nx.to_number() ==
             1,
           "Attention scores should match expected values exactly"

    # simple normalization is to divide each column element with sum of all row elements
    
    attn_scores_2_normalized =
      Nx.divide(attn_scores_2, Nx.sum(attn_scores_2, axes: [0]))

    expected_attn_scores_2_normalized =
      Nx.tensor(
        [
          0.14545010030269623,
          0.22783729434013367,
          0.22485026717185974,
          0.1285337507724762,
          0.10774646699428558,
          0.1655820906162262
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(attn_scores_2_normalized, expected_attn_scores_2_normalized, atol: 1.0e-6)
           |> Nx.to_number() == 1,
           "Normalized attention scores should match expected values exactly"

    attn_scores_2_normalized_sum = Nx.sum(attn_scores_2_normalized, axes: [0])

    # so when we sum normalized column elements, value is 1.0 (apparently LLM likes this normalization)

    assert Nx.all_close(attn_scores_2_normalized_sum, Nx.tensor([1.0], type: {:f, 32}),
             atol: 1.0e-6
           )
           |> Nx.to_number() == 1,
           "Sum of normalized attention scores should be 1.0"

    # softmax naive (there is a real one), is advanced way of normalization for really big numbers

    attn_scores_2_softmax_naive = softmax_naive(attn_scores_2)

    expected_attn_scores_2_softmax =
      Nx.tensor(
        [
          0.13854756951332092,
          0.237891286611557,
          0.23327402770519257,
          0.12399158626794815,
          0.10818187147378922,
          0.15811361372470856
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(
             attn_scores_2_softmax_naive,
             expected_attn_scores_2_softmax,
             atol: 1.0e-6
           )
           |> Nx.to_number() == 1,
           "Softmax of attention scores should match expected values exactly"

    # this is Axon softmax with all optimizations

    attn_scores_2_softmax_axon = Axon.Activations.softmax(attn_scores_2)

    expected_attn_scores_2_softmax_axon =
      Nx.tensor(
        [
          0.13854756951332092,
          0.237891286611557,
          0.23327402770519257,
          0.12399158626794815,
          0.10818187147378922,
          0.15811361372470856
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(attn_scores_2_softmax_axon, expected_attn_scores_2_softmax_axon, atol: 1.0e-6)
           |> Nx.to_number() == 1,
           "Axon Softmax of attention scores should match expected values exactly"

    # {6} dot {6, 3} = {3}
    # here we multiply first row (note that there is only one row) with first column. And we repeat that for all columns (there are three)
    # remember that multiplication of two one dimensional vectors is multiplication of corresponding elements and then summing them

    context_vec_2 = Nx.dot(attn_scores_2_softmax_axon, [0], inputs, [0])

    expected_context_vec_2 =
      Nx.tensor(
        [0.4418657422065735, 0.6514819860458374, 0.5683088898658752],
        type: {:f, 32}
      )

    assert Nx.all_close(context_vec_2, expected_context_vec_2, atol: 1.0e-6) |> Nx.to_number() ==
             1,
           "Context vector should match expected values exactly"
  end

  defp softmax_naive(%Nx.Tensor{} = x) do
    exp_x = Nx.exp(x)
    Nx.divide(exp_x, Nx.sum(exp_x, axes: [0]))
  end

inputs is fictional token embeddings vector, last what we did in post Working with test data. It contains vector for six tokens (six rows in inputs Tensor), with dimension of 3 (small enough for demonstraion purpose, but big number in practice). These are vectors x1 to x6. Each vector is embedding for a word, so we have six words with dimension of 3. query is second row in inputs. It could be any row, but Sebastian pick second row for which we will calculate simple self attention.
Then we calculate dot product of two vectors, inputs with dimension {6,3} and query with dimension {3}, result is tadda, simple attention Tensor with shape {6}. Each value is attention weight. But what is dot product?

Matrix dot product: A(6x3) * b(3x1) = c(6x1).
Important condition for dot product is that number of columns from first Tensor must match to number of rows in second Tensor.

How to apply this to Nx.dot function:
attn_scores_2 = Nx.dot(inputs, [1], query, [0])
Parameters are, first tensor, axis of first tensor, second tensor, axis of second tensor. So what is this axis? We need to direct Nx.dot over which axis to do the sum. In {6,3} axis are 0 and 1, in {3}, axis is just 0. So we tell that dot condition to be satisfied (same axis value), we are using 3 from first shape, and 3 from second shape.

As you will soon learn, dot product is the heart of attention mechanism! So all this GPUs are doing over and over, dot, dot, dot, …

But why dot product? Because result is the similarity of vectors that we dot, higher the number, gratter similarity. In words (as each vector is embedding of a word), that means how often those words come together in text.

We are preparing those Tensors for LLM, so we must to please him, It turns out to achieve LLM training stability, LLM likes weights to be > 0 and < 1, so the next step is weights normalizations. attn_score_2_normalize is that step in the code. In normalization we first sum all weights, and then each weight is divided with that sum. So when we sum weights, sum is 1.

But not so fast! Instead of this simple normalization we need to use softmax function for normalization. Why? Because it helps to taim big values. softmax_naive is just idea that could easily overflow or underflow, so Axon.Activations.softmax is well optimized softmax to handle overflows and underflows.

Last step is to calculate context vector for query token form the input. For that query we calculated attention weights. Now comes the hard part, to explain this Nx.dot calculation. We must do sum of multiplications over same axis. As first Tensor shape is {6} and second is {6, 3}, they match over first axis (index is 0).

Computing attention weights for all input tokens

Now we will calculate the context vector for all input words, not just the second one. The code is very similar to the previous one, with a few Nx-specific twists:

test "attention weights for all tokens" do
    inputs =
      Nx.tensor(
        [
          # Your (x^1)
          [0.43, 0.15, 0.89],
          # journey (x^2)
          [0.55, 0.87, 0.66],
          # starts (x^3)
          [0.57, 0.85, 0.64],
          # with (x^4)
          [0.22, 0.58, 0.33],
          # one (x^5)
          [0.77, 0.25, 0.10],
          # step (x^6)
          [0.05, 0.80, 0.55]
        ],
        type: {:f, 32}
      )

    # Scores for every query against every key: Q @ K^T
    # {6, 3} dot {6, 3} = {6, 3} dot {3, 6} (this is transponded second Tensor) = {6, 6}
    # result in Tensor {6 x 6} where first row is attention of first token with other tokens, and so on for second, third, ...
    # element {1, 1} is specific because this is attention of token with himself (I appologize for presonelizing token, but I am so excited that I started to understand those calculations!)

    attn_scores = Nx.dot(inputs, [1], inputs, [1])
    assert Nx.shape(attn_scores) == {6, 6}

    expected_attn_scores =
      Nx.tensor(
        [
          [
            0.9994999766349792,
            0.9544000029563904,
            0.9422000050544739,
            0.47530001401901245,
            0.4575999975204468,
            0.6309999823570251
          ],
          [
            0.9544000029563904,
            1.4950000047683716,
            1.4754000902175903,
            0.8434000015258789,
            0.7070000171661377,
            1.0865000486373901
          ],
          [
            0.9422000050544739,
            1.4754000902175903,
            1.4570000171661377,
            0.8295999765396118,
            0.715399980545044,
            1.0605000257492065
          ],
          [
            0.47530001401901245,
            0.8434000015258789,
            0.8295999765396118,
            0.4936999976634979,
            0.3473999798297882,
            0.656499981880188
          ],
          [
            0.4575999975204468,
            0.7070000171661377,
            0.715399980545044,
            0.3473999798297882,
            0.665399968624115,
            0.29350000619888306
          ],
          [
            0.6309999823570251,
            1.0865000486373901,
            1.0605000257492065,
            0.656499981880188,
            0.29350000619888306,
            0.9450000524520874
          ]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(attn_scores, expected_attn_scores, atol: 1.0e-6) |> Nx.to_number() == 1,
           "Attention scores should match expected values exactly"

    # Row-wise softmax gives attention weights per query token
    # do softmax over columns (row by row), which means use last (second) axis, and that is column

    attn_weights_softmax = Axon.Activations.softmax(attn_scores, axis: -1)

    expected_attn_weights_softmax =
      Nx.tensor(
        [
          [
            0.2098347693681717,
            0.20058146119117737,
            0.19814923405647278,
            0.12422822415828705,
            0.12204873561859131,
            0.14515765011310577
          ],
          [
            0.13854758441448212,
            0.237891286611557,
            0.23327404260635376,
            0.12399159371852875,
            0.10818187147378922,
            0.15811361372470856
          ],
          [
            0.1390075981616974,
            0.23692145943641663,
            0.23260195553302765,
            0.12420440465211868,
            0.11080020666122437,
            0.15646442770957947
          ],
          [
            0.14352688193321228,
            0.20739442110061646,
            0.20455202460289001,
            0.14619223773479462,
            0.12629525363445282,
            0.172039195895195
          ],
          [
            0.15261085331439972,
            0.19583867490291595,
            0.1974906474351883,
            0.13668666779994965,
            0.18785890936851501,
            0.12951429188251495
          ],
          [
            0.13847115635871887,
            0.2183637171983719,
            0.21275943517684937,
            0.14204756915569305,
            0.09880637377500534,
            0.18955175578594208
          ]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(attn_weights_softmax, expected_attn_weights_softmax, atol: 1.0e-6)
           |> Nx.to_number() ==
             1

    assert Nx.shape(attn_weights_softmax) == {6, 6}

    # [1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

    row_sums = Nx.sum(attn_weights_softmax, axes: [1])

    ones = Nx.broadcast(Nx.tensor(1.0, type: {:f, 32}), {6})

    #belive it or not, but sum of each row (over columns) is 1!

    assert Nx.all_close(row_sums, ones, atol: 1.0e-6) |> Nx.to_number() == 1

    # {6, 6} dot {6, 3} = {6, 3}
    # result is context vector for all tokens!

    context_vecs = Nx.dot(attn_weights_softmax, [1], inputs, [0])
    expected_context_vecs =
      Nx.tensor(
        [
          [0.4420594274997711, 0.5930986404418945, 0.5789890885353088],
          [0.4418657422065735, 0.6514819860458374, 0.5683088898658752],
          [0.4431275427341461, 0.6495946049690247, 0.5670731067657471],
          [0.43038973212242126, 0.6298280954360962, 0.5510270595550537],
          [0.4671017527580261, 0.5909927487373352, 0.5265965461730957],
          [0.41772449016571045, 0.650323212146759, 0.5645352005958557]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(context_vecs, expected_context_vecs, atol: 1.0e-6) |> Nx.to_number() ==
             1,
           "Context vectors should match expected values exactly"

    assert Nx.shape(context_vecs) == {6, 3}
  end

We start with the same token embeddings for six words in three dimensions. Nx.dot is a powerful function, so there is no need for any Enum looping. With this:

attn_scores = Nx.dot(inputs, [1], inputs, [1])
assert Nx.shape(attn_scores) == {6, 6}

We multiply inputs by itself by connecting both tensors over columns. Remember that the number of columns in the first tensor must be equal to the number of rows in the second one.

Then we calculate the softmax:

`attn_weights_softmax = Axon.Activations.softmax(attn_scores, axis: -1)`

Here we tell softmax to sum over the last axis. For shape {6, 6}, that means each row is normalized across columns.

To calculate the context vector:

context_vecs = Nx.dot(attn_weights_softmax, [1], inputs, [0])

We compute the dot product of weights with shape {6, 6} and inputs with shape {6, 3} over the second and first axes (columns and rows). You can check that row 2 in expected_context_vecs is the same as context_vecs from the previous section.

Implementing self-attention with trainable weights

OK, let’s complicate things. We will implement scaled dot-product attention (this sounds scary, but we are building on top of a simple self-attention mechanism). How? We will put trainable weights into the mix.

Computing the attention weights in a New Kids on the Block way, step by step

As New Kids on the Block got their girls step by step, we will use the same approach for the self-attention mechanism.

test "self-attention mechanism with trainable weights" do
    inputs =
      Nx.tensor(
        [
          # Your (x^1)
          [0.43, 0.15, 0.89],
          # journey (x^2)
          [0.55, 0.87, 0.66],
          # starts (x^3)
          [0.57, 0.85, 0.64],
          # with (x^4)
          [0.22, 0.58, 0.33],
          # one (x^5)
          [0.77, 0.25, 0.10],
          # step (x^6)
          [0.05, 0.80, 0.55]
        ],
        type: {:f, 32}
      )

    # we already know this Nx trick how to get second row of Tensor with dimension {6, 3}

    x_2 = Nx.slice_along_axis(inputs, 1, 1, axis: 0) |> Nx.squeeze(axes: [0])

    # this is first dimension of x_2, = 3
    d_in = Nx.shape(x_2) |> elem(0)

    # this is Sebastian's simplificaiton, so we could easier follow the computation.
    # in real life, d_out = d_in

    d_out = 2
    # random key for linear gneration of weights

    key = LlmScratch.Random.manual_seed(123)
    query_weights = Axon.Initializers.uniform(scale: 1.0).({d_in, d_out}, {:f, 32}, key)
    key_weights = Axon.Initializers.uniform(scale: 1.0).({d_in, d_out}, {:f, 32}, key)
    value_weights = Axon.Initializers.uniform(scale: 1.0).({d_in, d_out}, {:f, 32}, key)

    # 1x3 dot 3x2 = 1x2
    # when dimension rule is satisfied, no of columns = no of rows, no need to state this in Nx.dot as Nx.dot(x_2, [1], query_weights, [0])
    # we can say that we are projecting token from dimension 3 => 2

    query_2 = Nx.dot(x_2, query_weights)
    assert Nx.shape(query_2) == {d_out}
    expected_query_2 = Nx.tensor([-0.20726783573627472, -0.3094936013221741], type: {:f, 32})
    expected_query_2_book = Nx.tensor([0.4306, 1.4551], type: {:f, 32})

    assert Nx.all_close(query_2, expected_query_2, atol: 1.0e-6) |> Nx.to_number() == 1,
           "query_2 should match expected values exactly"

    # if you are also following examples from Sebastian Python examples, numbers are different
    # reson is different implementation of Pythorch and Nx random generator

    refute Nx.all_close(query_2, expected_query_2_book, atol: 1.0e-6) |> Nx.to_number() == 1,
           "query_2 should not match expected values exactly due to different random number generators in PyTorch and Nx"

    key_2 = Nx.dot(x_2, key_weights)
    assert Nx.shape(key_2) == {d_out}

    value_2 = Nx.dot(x_2, value_weights)
    assert Nx.shape(value_2) == {d_out}

    # lets calculate all keys and values
    # 6x3 dot 3x2 = 6x2
    keys = Nx.dot(inputs, key_weights)
    assert Nx.shape(keys) == {6, d_out}

    # 6x3 dot 3x2 = 6x2
    values = Nx.dot(inputs, value_weights)
    assert Nx.shape(values) == {6, d_out}

    keys_2 = Nx.slice_along_axis(keys, 1, 1, axis: 0) |> Nx.squeeze(axes: [0])
    # 1x2 dot 2x1 = 1x1
    attn_scores_22 = Nx.dot(query_2, keys_2)
    assert Nx.shape(attn_scores_22) == {}
    expected_attn_scores_22 = Nx.tensor([0.1387462466955185], type: {:f, 32})

    assert Nx.all_close(attn_scores_22, expected_attn_scores_22, atol: 1.0e-6) |> Nx.to_number() ==
             1,
           "attn_scores_22 should match expected values exactly"

    expected_attn_scores_22_book = Nx.tensor([1.8524], type: {:f, 32})

    refute Nx.all_close(attn_scores_22, expected_attn_scores_22_book, atol: 1.0e-6)
           |> Nx.to_number() == 1,
           "attn_scores_22 should not match expected values exactly due to different random number generators in PyTorch and Nx"

    # {2} dot {6,2} over feature dim -> {6}
    attn_scores_2 = Nx.dot(query_2, [0], keys, [1])
    assert Nx.shape(attn_scores_2) == {6}

    expected_attn_scores_2 =
      Nx.tensor(
        [
          -0.17058327794075012,
          0.1387462466955185,
          0.14079777896404266,
          0.10855000466108322,
          0.13786746561527252,
          0.09178745746612549
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(attn_scores_2, expected_attn_scores_2, atol: 1.0e-6) |> Nx.to_number() == 1,
           "attn_scores_2 should match expected values"

    # for normalization we are using last (second) axis of keys

    d_k = Nx.axis_size(keys, -1) |> Nx.tensor(type: {:f, 32})

    attn_weights_2 =
      attn_scores_2
      |> Nx.divide(Nx.sqrt(d_k))
      |> Axon.Activations.softmax(axis: -1)

    assert Nx.shape(attn_weights_2) == {6}

    expected_attn_weights_2 =
      Nx.tensor(
        [
          0.1397317796945572,
          0.17389537394046783,
          0.1741478145122528,
          0.1702217161655426,
          0.17378734052181244,
          0.16821600496768951
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(attn_weights_2, expected_attn_weights_2, atol: 1.0e-6) |> Nx.to_number() == 1,
           "attn_weights_2 should match expected values"

    # {6} dot {6, 2} = {2}
    context_vec_2 = Nx.dot(attn_weights_2, [0], values, [0])
    assert Nx.shape(context_vec_2) == {d_out}

    expected_context_vec_2 =
      Nx.tensor(
        [-0.11537063866853714, -0.18990936875343323],
        type: {:f, 32}
      )

    assert Nx.all_close(context_vec_2, expected_context_vec_2, atol: 1.0e-6) |> Nx.to_number() == 1,
           "context_vec_2 should match expected values"
  end

We start with the same inputs of six words with shape {6,3}. Then we introduce the three kings of LLMs: query, key, and value trainable matrices. Let’s use some scary math notation: Wq, Wk, and Wv. Again, Sebastian picks x2 from inputs, but that could be any of x1..x6. We set the dimensions of the Wq, Wk, and Wv matrices to {3, 2}. Three is the second axis of inputs, and 2 is used to make the computation easier to follow. In the real world, we would likely have {3,3}.

The next step is to fill Wq, Wk, and Wv with random values with the help of the Axon.Initializers.uniform function, and to compute query_2, key_2, and value_2 for x_2 by taking the dot of x_2 and the corresponding weight matrices. The results are vectors of dimension {2}.

Sebastian explains the difference between weights and attention weights. Weight matrices are static values learned during model training, while attention weights are dynamic, context-specific values.

Using the dot product, we can also calculate keys and values by dotting inputs with Wk and Wv.

Now it’s time for attention_score_22, where we dot query_2 and key_2, and for attention_score_2, where we dot query_2 and keys. We have prepared everything to calculate attn_weights_2. We first need to scale attention_score_2 by dividing each vector element by Nx.sqrt of the last dimension of keys {3,2}, which is 2. Then we use the softmax function over the last dimension. Why are we doing this? To avoid small gradients, and we will learn that small gradients are the worst enemies of LLMs. This is called a scaled-dot product.

Finally, we calculate context_vector_2 (for the x2 token) with Nx.dot(attn_weights_2, [0], values, [0]), which dots attn_weights_2 and values. Note that we need to set the axes explicitly because we are dotting dimensions {6} and {6,2}. We use [0] and [0] because those are compatible axes, 6 and 6.

Why do we use the terms query, key, and value? The analogy comes from data retrieval: query is the database search, key is the database index, and value is the actual stored value. So we search for a query token from the input; in LLMs, token values are stored with their corresponding keys (remember that this is an index in the token vocabulary).

Implementing compact self-attention modules

In this section, we will consolidate code from the previous test, as we will obviously need the self-attention mechanism later in this book. The book uses a class; we will do it via Elixir modules. Version v1 is the same as what we did in the test “self-attention mechanism with trainable weights.” The only difference is that now we calculate the full context vector for all input tokens, so we get a tensor with shape {6,2}:

defmodule LlmScratch.SelfAttentionV1 do
  @moduledoc """
  Simple self-attention module with trainable projection weights.

  API:

    * `new/2` and `new/3` - initialize `w_q`, `w_k`, and `w_v`
    * `forward/2` - compute full context vectors for all tokens
  """

  defstruct [:w_q, :w_k, :w_v, :d_in, :d_out, :seed]

  @type t :: %__MODULE__{
          w_q: Nx.Tensor.t(),
          w_k: Nx.Tensor.t(),
          w_v: Nx.Tensor.t(),
          d_in: pos_integer(),
          d_out: pos_integer(),
          seed: integer() | nil
        }

  @spec new(pos_integer(), pos_integer()) :: t()
  @doc """
  Creates a self-attention module with randomly initialized projection weights.

  ## Arguments

    * `d_in` - input feature size.
    * `d_out` - projection/output feature size.

  ## Returns

    * `%LlmScratch.SelfAttentionV1{}` with initialized `w_q`, `w_k`, and `w_v`.
  """
  def new(d_in, d_out), do: new(d_in, d_out, [])

  @spec new(pos_integer(), pos_integer(), keyword()) :: t()
  @doc """
  Creates a self-attention module with optional `:seed` and optional manual
  projection weights (`:w_q`, `:w_k`, `:w_v`).

  ## Arguments

    * `d_in` - input feature size.
    * `d_out` - projection/output feature size.
    * `opts` - keyword options:
      `:seed`, `:w_q`, `:w_k`, `:w_v`.

  ## Returns

    * `%LlmScratch.SelfAttentionV1{}` with initialized or provided projections.
  """
  def new(d_in, d_out, opts)
      when is_integer(d_in) and d_in > 0 and is_integer(d_out) and d_out > 0 do
    seed = Keyword.get(opts, :seed)

    key =
      case seed do
        nil -> LlmScratch.Random.manual_seed(System.unique_integer([:positive]))
        int when is_integer(int) -> LlmScratch.Random.manual_seed(int)
        other -> raise ArgumentError, "seed must be an integer or nil, got: #{inspect(other)}"
      end

    # in v1 we uniformly initialize weight tensor

    w_q = init_weight({d_in, d_out}, key, Keyword.get(opts, :w_q))
    w_k = init_weight({d_in, d_out}, key, Keyword.get(opts, :w_k))
    w_v = init_weight({d_in, d_out}, key, Keyword.get(opts, :w_v))

    %__MODULE__{w_q: w_q, w_k: w_k, w_v: w_v, d_in: d_in, d_out: d_out, seed: seed}
  end

  @spec forward(t(), Nx.Tensor.t()) :: Nx.Tensor.t()
  @doc """
  Computes context vectors for all tokens in `inputs`.

  ## Arguments

    * `sa` - `%LlmScratch.SelfAttentionV1{}` module state.
    * `inputs` - tensor of shape `{num_tokens, d_in}`.

  ## Returns

    * context tensor of shape `{num_tokens, d_out}`.
  """
  def forward(%__MODULE__{} = sa, %Nx.Tensor{} = inputs) do
    validate_input_shape!(inputs, sa.d_in)

    # calculate query, key and values based on inputs and weight Tensors
    # {no_of_tokens x token_dimension} dot {d_in, d_out} = {no_of_tokens, d_out}
    # now you know why we validate_input_shape that token_dimension == d_in

    q = Nx.dot(inputs, sa.w_q)
    k = Nx.dot(inputs, sa.w_k)
    v = Nx.dot(inputs, sa.w_v)

    # result is context vector {num_of_tokens x d_out}

    LlmScratch.SelfAttentionCore.context_from_qkv(q, k, v, sa.d_out)
  end

  defp init_weight(expected_shape, _key, %Nx.Tensor{} = provided_weight) do
    if Nx.shape(provided_weight) != expected_shape do
      raise ArgumentError,
            "expected weight shape #{inspect(expected_shape)}, got: #{inspect(Nx.shape(provided_weight))}"
    end

    Nx.as_type(provided_weight, {:f, 32})
  end

  defp init_weight(expected_shape, key, nil) do
    Axon.Initializers.uniform(scale: 1.0).(expected_shape, {:f, 32}, key)
  end

  defp init_weight(_expected_shape, _key, other) do
    raise ArgumentError, "weight must be an Nx.Tensor or nil, got: #{inspect(other)}"
  end

  defp validate_input_shape!(inputs, expected_d_in) do
    case Nx.shape(inputs) do
      {_, ^expected_d_in} ->
        :ok

      shape ->
        raise ArgumentError,
              "expected inputs shape {num_tokens, #{expected_d_in}}, got: #{inspect(shape)}"
    end
  end
end

defmodule LlmScratch.SelfAttentionCore do
  @moduledoc """
  has function that is shared in LlmScratch.SelfAttentionV1 and LlmScratch.SelfAttentionV2

  """

  @spec context_from_qkv(Nx.Tensor.t(), Nx.Tensor.t(), Nx.Tensor.t(), pos_integer()) ::
          Nx.Tensor.t()
  @doc """
  Computes self-attention context vectors from projected query, key, and value
  tensors.

  ## Arguments

    * `q` - query tensor of shape `{num_tokens, d_out}`.
    * `k` - key tensor of shape `{num_tokens, d_out}`.
    * `v` - value tensor of shape `{num_tokens, d_out}`.
    * `d_out` - feature size used for score scaling by `sqrt(d_out)`.

  ## Returns

    * context tensor of shape `{num_tokens, d_out}`.
  """
  def context_from_qkv(q, k, v, d_out) do

    # {no_of_tokens, d_out} dot {no_of_tokens, d_out} = {no_of_tokens, no_of_tokens}
    # dot over second dimension d_out
    
    attn_scores = Nx.dot(q, [1], k, [1])
    d_k = Nx.tensor(d_out, type: {:f, 32})

    attn_weights =
      attn_scores
      |> Nx.divide(Nx.sqrt(d_k))
      |> Axon.Activations.softmax(axis: -1)

    # {no_of_tokens, no_of_tokens} dot {no_of_tokens, d_out} = {no_of_tokens, d_out}
    # because we are dotting row (over second dimension) with column (over first dimension)

    Nx.dot(attn_weights, [1], v, [0])
  end
end

We have SelfAttentionCore because the context_from_qkv function is the same for SelfAttention V1 and V2. V2 differs by using torch.nn.Linear-style tensor initialization for weights. Why? Because it is optimized and leads to more stable and effective model training:

defmodule LlmScratch.SelfAttentionV2 do
  @moduledoc """
  Self-attention module initialized from Axon dense layers.

  API:

    * `new/2` and `new/3` - initialize `w_q`, `w_k`, and `w_v`
      using `Axon.input/2 |> Axon.dense/3`
    * `forward/2` - compute full context vectors for all tokens
  """

  defstruct [:w_q, :w_k, :w_v, :d_in, :d_out, :seed]

  @type dense_weights :: %{kernel: Nx.Tensor.t(), bias: Nx.Tensor.t()}

  @type t :: %__MODULE__{
          w_q: dense_weights(),
          w_k: dense_weights(),
          w_v: dense_weights(),
          d_in: pos_integer(),
          d_out: pos_integer(),
          seed: integer()
        }

  @spec new(pos_integer(), pos_integer()) :: t()
  @doc """
  Creates a self-attention module whose projection weights are initialized via
  Axon dense layers.

  ## Arguments

    * `d_in` - input feature size.
    * `d_out` - projection/output feature size.

  ## Returns

    * `%LlmScratch.SelfAttentionV2{}` with dense-initialized `w_q`, `w_k`, and `w_v`.
  """
  def new(d_in, d_out), do: new(d_in, d_out, [])

  @spec new(pos_integer(), pos_integer(), keyword()) :: t()
  @doc """
  Creates a self-attention module with Axon dense-initialized projections.

  ## Arguments

    * `d_in` - input feature size.
    * `d_out` - projection/output feature size.
    * `opts` - keyword options:
      `:seed` (optional, deterministic initialization).

  ## Returns

    * `%LlmScratch.SelfAttentionV2{}` with dense-initialized projections.
  """
  def new(d_in, d_out, opts)
      when is_integer(d_in) and d_in > 0 and is_integer(d_out) and d_out > 0 do
    seed = normalize_seed(Keyword.get(opts, :seed))

    w_q = init_dense_weights(d_in, d_out, seed, "q_proj")
    w_k = init_dense_weights(d_in, d_out, seed, "k_proj")
    w_v = init_dense_weights(d_in, d_out, seed, "v_proj")

    %__MODULE__{w_q: w_q, w_k: w_k, w_v: w_v, d_in: d_in, d_out: d_out, seed: seed}
  end

  @spec forward(t(), Nx.Tensor.t()) :: Nx.Tensor.t()
  @doc """
  Computes context vectors for all tokens in `inputs`.

  ## Arguments

    * `sa` - `%LlmScratch.SelfAttentionV2{}` module state.
    * `inputs` - tensor of shape `{num_tokens, d_in}`.

  ## Returns

    * context tensor of shape `{num_tokens, d_out}`.
  """
  def forward(%__MODULE__{} = sa, %Nx.Tensor{} = inputs) do
    validate_input_shape!(inputs, sa.d_in)

    q = dense_project(inputs, sa.w_q)
    k = dense_project(inputs, sa.w_k)
    v = dense_project(inputs, sa.w_v)
    LlmScratch.SelfAttentionCore.context_from_qkv(q, k, v, sa.d_out)
  end

  defp init_dense_weights(d_in, d_out, seed, layer_name) do
    model =
      Axon.input("input", shape: {nil, d_in})
      |> Axon.dense(d_out, use_bias: true, name: layer_name)

    {init_fn, _predict_fn} = Axon.build(model, seed: seed)
    params = init_fn.(Nx.template({1, d_in}, {:f, 32}), Axon.ModelState.empty())
    extract_dense_weights!(params, layer_name)
  end

  defp extract_dense_weights!(%Axon.ModelState{} = params, layer_name) do
    params
    |> Axon.ModelState.trainable_parameters()
    |> extract_dense_weights!(layer_name)
  end

  defp extract_dense_weights!(params, layer_name) when is_map(params) do
    layer_params =
      Map.get(params, layer_name) ||
        Enum.find_value(Map.values(params), fn
          layer when is_map(layer) ->
            kernel = Map.get(layer, "kernel") || Map.get(layer, :kernel)
            bias = Map.get(layer, "bias") || Map.get(layer, :bias)

            if match?(%Nx.Tensor{}, kernel) and match?(%Nx.Tensor{}, bias) do
              layer
            else
              nil
            end

          _ ->
            nil
        end)

    kernel = layer_params && (Map.get(layer_params, "kernel") || Map.get(layer_params, :kernel))
    bias = layer_params && (Map.get(layer_params, "bias") || Map.get(layer_params, :bias))

    if match?(%Nx.Tensor{}, kernel) and match?(%Nx.Tensor{}, bias) do
      %{kernel: Nx.as_type(kernel, {:f, 32}), bias: Nx.as_type(bias, {:f, 32})}
    else
      raise ArgumentError,
            "could not extract dense kernel/bias params for layer #{inspect(layer_name)}"
    end
  end

  defp dense_project(inputs, %{kernel: kernel, bias: bias}) do
    Nx.add(Nx.dot(inputs, kernel), bias)
  end

  defp normalize_seed(nil), do: System.unique_integer([:positive])
  defp normalize_seed(seed) when is_integer(seed), do: seed

  defp normalize_seed(seed) do
    raise ArgumentError, "seed must be an integer or nil, got: #{inspect(seed)}"
  end

  defp validate_input_shape!(inputs, expected_d_in) do
    case Nx.shape(inputs) do
      {_, ^expected_d_in} ->
        :ok

      shape ->
        raise ArgumentError,
              "expected inputs shape {num_tokens, #{expected_d_in}}, got: #{inspect(shape)}"
    end
  end
end

This initialization is more complex than in v1, and it is based on Axon (LLM implements it based on Python torch.nn.Linear). For now, we will not dive deeper into this implementation, as this would require whole new blog post.

Here are some tests where we use the v1 and v2 modules, so you can get a feel for how much less code we need to write. context vectors values are different because we use different initializations:

test "Implementing a compact self-attention module" do
    inputs =
      Nx.tensor(
        [
          # Your (x^1)
          [0.43, 0.15, 0.89],
          # journey (x^2)
          [0.55, 0.87, 0.66],
          # starts (x^3)
          [0.57, 0.85, 0.64],
          # with (x^4)
          [0.22, 0.58, 0.33],
          # one (x^5)
          [0.77, 0.25, 0.10],
          # step (x^6)
          [0.05, 0.80, 0.55]
        ],
        type: {:f, 32}
      )

    sa = LlmScratch.SelfAttentionV1.new(3, 2, seed: 123)

    context_vecs = LlmScratch.SelfAttentionV1.forward(sa, inputs)

    assert Nx.shape(context_vecs) == {6, 2}

    expected_context_vecs =
      Nx.tensor(
        [
          [-0.07548463344573975, -0.15017275512218475],
          [-0.11537063866853714, -0.18990936875343323],
          [-0.11561498790979385, -0.19015151262283325],
          [-0.11222726106643677, -0.1867683380842209],
          [-0.11577533185482025, -0.1902845799922943],
          [-0.11009891331195831, -0.18466056883335114]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(context_vecs, expected_context_vecs, atol: 1.0e-6) |> Nx.to_number() == 1,
           "context_vecs should match expected values exactly"
  end

  test "compact self-attention module v2 uses Axon dense initialization" do
    inputs =
      Nx.tensor(
        [
          [0.43, 0.15, 0.89],
          [0.55, 0.87, 0.66],
          [0.57, 0.85, 0.64],
          [0.22, 0.58, 0.33],
          [0.77, 0.25, 0.10],
          [0.05, 0.80, 0.55]
        ],
        type: {:f, 32}
      )

    sa = LlmScratch.SelfAttentionV2.new(3, 2, seed: 789)

    context_vecs = LlmScratch.SelfAttentionV2.forward(sa, inputs)

    assert Nx.shape(context_vecs) == {6, 2}

    expected_context_vecs =
      Nx.tensor(
        [
          [0.20869487524032593, -0.11512904614210129],
          [0.1995905637741089, -0.10041604191064835],
          [0.197800412774086, -0.09748103469610214],
          [0.20753224194049835, -0.11311019212007523],
          [0.16690319776535034, -0.04650232568383217],
          [0.22278699278831482, -0.1379932463169098]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(context_vecs, expected_context_vecs, atol: 1.0e-6) |> Nx.to_number() == 1,
           "context_vecs should match expected values exactly"
  end

  test "exercise 3.1 - transfer v2 weights into v1 and match outputs" do
    inputs =
      Nx.tensor(
        [
          [0.43, 0.15, 0.89],
          [0.55, 0.87, 0.66],
          [0.57, 0.85, 0.64],
          [0.22, 0.58, 0.33],
          [0.77, 0.25, 0.10],
          [0.05, 0.80, 0.55]
        ],
        type: {:f, 32}
      )

    sa_v2 = LlmScratch.SelfAttentionV2.new(3, 2, seed: 123)

    # Axon dense kernels in this project are already shaped {d_in, d_out},
    # matching V1's expected projection weight layout.
    # goal of this exercise is to show that v1 and v2 have same calculations, only difference are weight Tensors

    sa_v1 =
      LlmScratch.SelfAttentionV1.new(3, 2,
        w_q: sa_v2.w_q.kernel,
        w_k: sa_v2.w_k.kernel,
        w_v: sa_v2.w_v.kernel
      )

    context_v2 = LlmScratch.SelfAttentionV2.forward(sa_v2, inputs)
    context_v1 = LlmScratch.SelfAttentionV1.forward(sa_v1, inputs)

    assert Nx.shape(context_v1) == {6, 2}
    assert Nx.shape(context_v2) == {6, 2}

    assert Nx.all_close(context_v1, context_v2, atol: 1.0e-6) |> Nx.to_number() == 1,
           "after copying weights, v1 and v2 should produce the same outputs"
  end

Hiding future words with causal attention

We are moving from simple attention to causal attention mechanism. Simple attention considered all tokens, before and after the current token position for calculating attention scores. Causal attention is considering only tokens before current postion. Why? Because we want to train LLM to predict token only using previous tokens, not to be able to “cheat” and look for the answers on the right side.

For example, take the six-word sentence I love learning with tiny models. Before masking, a query token can assign attention scores to every token in the sentence:

Query \ Key	I	love	learning	with	tiny	models
I	0.25	0.20	0.18	0.12	0.13	0.12
love	0.14	0.28	0.19	0.16	0.11	0.12
learning	0.10	0.16	0.29	0.18	0.15	0.12
with	0.09	0.13	0.19	0.28	0.17	0.14
tiny	0.08	0.11	0.13	0.20	0.31	0.17
models	0.07	0.09	0.11	0.14	0.21	0.38

With causal masking, we block all attention weights above the diagonal, so each token can only attend to itself and earlier tokens:

Query \ Key	I	love	learning	with	tiny	models
I	1.00	X	X	X	X	X
love	0.22	0.78	X	X	X	X
learning	0.18	0.27	0.55	X	X	X
with	0.12	0.17	0.26	0.45	X	X
tiny	0.10	0.13	0.16	0.24	0.37	X
models	0.07	0.09	0.11	0.14	0.21	0.38

Here X means “masked out”. In practice those entries are set to a very large negative value before softmax, so their final attention probability becomes zero.

The same flow can be visualized like this:

Applying a causal attention mask

+---------------------------+   1) apply softmax   +-----------------------------+
| Attention scores          | -------------------> | Attention weights           |
| (unnormalized)            |                      | (normalized)                |
+---------------------------+                      +-----------------------------+
                                                         |
                                                         | 2) mask 0's above diagonal
                                                         v
                                            +-----------------------------+
                                            | Masked attention scores     |
                                            | (unnormalized)              |
                                            +-----------------------------+
                                                         |
                                                         | 3) normalize rows
                                                         v
                                            +-----------------------------+
                                            | Masked attention weights    |
                                            | (normalized)                |
                                            +-----------------------------+

"Normalized" means that the values in each row sum to 1.

The ASCII diagram above shows what we are going to implement in our codebase. Follow the comments in the code below.

test "casual attention mask" do
    inputs =
      Nx.tensor(
        [
          [0.43, 0.15, 0.89],
          [0.55, 0.87, 0.66],
          [0.57, 0.85, 0.64],
          [0.22, 0.58, 0.33],
          [0.77, 0.25, 0.10],
          [0.05, 0.80, 0.55]
        ],
        type: {:f, 32}
      )

    # first step
    # start with self attention module v2 from our previous chapter

    sa_v2 = LlmScratch.SelfAttentionV2.new(3, 2, seed: 123)

    #torch like linear projection

    queries = LlmScratch.SelfAttentionV2.dense_project(inputs, sa_v2.w_q)
    keys = LlmScratch.SelfAttentionV2.dense_project(inputs, sa_v2.w_k)

    # {6,2} dot {6,2} = {6,6} where we connect on second axis of first and second Tensor
    # remember the condition for Tensor dot product!
    # we are actually transponding second Tensor

    attn_scores = Nx.dot(queries, [1], keys, [1])

    # normalize attention scores with softmax to get attention weights, sum of each row is 1

    attn_weights =
      Nx.divide(attn_scores, Nx.sqrt(Nx.axis_size(keys, -1)))
      |> Axon.Activations.softmax(axis: -1)

    expected_attn_weights =
      Nx.tensor(
        [
          [
            0.1531660407781601,
            0.1543799340724945,
            0.1536969095468521,
            0.1883421689271927,
            0.1566835194826126,
            0.1937314122915268
          ],
          [
            0.14633320271968842,
            0.14933642745018005,
            0.14835353195667267,
            0.19792792201042175,
            0.1513950079679489,
            0.2066539078950882
          ],
          [
            0.14623598754405975,
            0.14969584345817566,
            0.1487070620059967,
            0.19760626554489136,
            0.15132319927215576,
            0.20643165707588196
          ],
          [
            0.15648190677165985,
            0.15691247582435608,
            0.15638375282287598,
            0.1835598349571228,
            0.15919820964336395,
            0.18746380507946014
          ],
          [
            0.15005655586719513,
            0.1612723171710968,
            0.1604372262954712,
            0.18309712409973145,
            0.15431715548038483,
            0.1908196359872818
          ],
          [
            0.15678176283836365,
            0.1527770310640335,
            0.15225917100906372,
            0.1877351701259613,
            0.15941700339317322,
            0.1910298764705658
          ]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(attn_weights, expected_attn_weights, atol: 1.0e-6) |> Nx.to_number() == 1,
           "attn_weights should match expected values"

    assert Nx.shape(attn_weights) == {6, 6}

    # second step, zero out values above the diagonal in attention weights

    context_length = Nx.axis_size(attn_scores, 0)

    mask_simple =
      Nx.broadcast(1.0, {context_length, context_length})
      |> Nx.tril()

    mask_simple_expected =
      Nx.tensor(
        [
          [1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
          [1.0, 1.0, 0.0, 0.0, 0.0, 0.0],
          [1.0, 1.0, 1.0, 0.0, 0.0, 0.0],
          [1.0, 1.0, 1.0, 1.0, 0.0, 0.0],
          [1.0, 1.0, 1.0, 1.0, 1.0, 0.0],
          [1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(mask_simple, mask_simple_expected, atol: 1.0e-6) |> Nx.to_number() == 1,
           "mask_simple should match expected values"

    # to apply mask we are multiplying attention weights with mask_simple
    # {6 x 6} x {6, 6} = {6, 6}
    # multiply first row element in first tensor with first row element in second Tensor.
    # then continue with second elements in row
    # effect is that elements multiplied with zero are zeroed out

    masked_attn_weights = Nx.multiply(attn_weights, mask_simple)

    assert Nx.shape(masked_attn_weights) == {6, 6}

    masked_attn_weights_expected =
      Nx.tensor(
        [
          [
            [0.1531660407781601, 0.0, 0.0, 0.0, 0.0, 0.0],
            [0.14633320271968842, 0.14933642745018005, 0.0, 0.0, 0.0, 0.0],
            [0.14623598754405975, 0.14969584345817566, 0.1487070620059967, 0.0, 0.0, 0.0],
            [
              0.15648190677165985,
              0.15691247582435608,
              0.15638375282287598,
              0.1835598349571228,
              0.0,
              0.0
            ],
            [
              0.15005655586719513,
              0.1612723171710968,
              0.1604372262954712,
              0.18309712409973145,
              0.15431715548038483,
              0.0
            ],
            [
              0.15678176283836365,
              0.1527770310640335,
              0.15225917100906372,
              0.1877351701259613,
              0.15941700339317322,
              0.1910298764705658
            ]
          ]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(masked_attn_weights, masked_attn_weights_expected, atol: 1.0e-6)
           |> Nx.to_number() == 1,
           "masked_attn_weights should match expected values"

    # Third step is row normalization for masked attention weights, sum of each row must be 1
    # sum of each row

    row_sums = Nx.sum(masked_attn_weights, axes: [-1], keep_axes: true)

    # each row element is divided with row sum

    masked_attn_weights_norm =
      Nx.divide(masked_attn_weights, row_sums)

    expected_masked_attn_weights_norm =
      Nx.tensor(
        [
          [1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
          [0.49492135643959045, 0.5050787329673767, 0.0, 0.0, 0.0, 0.0],
          [0.32888707518577576, 0.33666834235191345, 0.3344445526599884, 0.0, 0.0, 0.0],
          [
            0.2395114302635193,
            0.2401704639196396,
            0.23936119675636292,
            0.2809569537639618,
            0.0,
            0.0
          ],
          [
            0.18544265627861023,
            0.1993032991886139,
            0.1982712745666504,
            0.2262747883796692,
            0.1907079815864563,
            0.0
          ],
          [
            0.15678176283836365,
            0.1527770310640335,
            0.15225917100906372,
            0.1877351701259613,
            0.15941700339317322,
            0.1910298764705658
          ]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(masked_attn_weights_norm, expected_masked_attn_weights_norm, atol: 1.0e-6)
           |> Nx.to_number() == 1,
           "masked_attn_weights_norm should match expected values"

    # improvement, masking witn negative infinity instead of 1
    # this is again step 2, we start from attn_scores

    mask =
      Nx.broadcast(1.0, {context_length, context_length})
      # upper triangle above diagonal
      |> Nx.triu(k: 1)

    mask_bool = Nx.greater(mask, 0.0)
    neg_inf = Nx.broadcast(:neg_infinity, Nx.shape(attn_scores))
    masked_neg_inf_attn_scores = Nx.select(mask_bool, neg_inf, attn_scores)

    expected_masked_neg_inf_att_scores =
      Nx.tensor(
        [
          [
            -0.5037099719047546,
            :neg_infinity,
            :neg_infinity,
            :neg_infinity,
            :neg_infinity,
            :neg_infinity
          ],
          [
            -0.7201937437057495,
            -0.69146329164505,
            :neg_infinity,
            :neg_infinity,
            :neg_infinity,
            :neg_infinity
          ],
          [
            -0.7123136520385742,
            -0.6792438626289368,
            -0.6886162161827087,
            :neg_infinity,
            :neg_infinity,
            :neg_infinity
          ],
          [
            -0.3948274254798889,
            -0.3909415006637573,
            -0.39571473002433777,
            -0.16911853849887848,
            :neg_infinity,
            :neg_infinity
          ],
          [
            -0.3698960840702057,
            -0.2679566442966461,
            -0.2752986252307892,
            -0.08846122026443481,
            -0.3303014636039734,
            :neg_infinity
          ],
          [
            -0.4973980784416199,
            -0.533991277217865,
            -0.53879314661026,
            -0.24258869886398315,
            -0.473825067281723,
            -0.21798484027385712
          ]
        ],
        type: {:f, 32}
      )

    assert Nx.all(Nx.equal(masked_neg_inf_attn_scores, expected_masked_neg_inf_att_scores)) |> Nx.to_number() == 1,
           "masked should match expected values"

    # softmax normalization, note that rows are summing to one out of the box, thanks to negative infinity trick!

    masked_neg_inf_attn_weights_causal =
      Nx.divide(masked_neg_inf_attn_scores, Nx.sqrt(Nx.axis_size(keys, -1)))
      |> Axon.Activations.softmax(axis: -1)

    expected_masked_neg_inf_attn_weights_causal =
      Nx.tensor(
        [
          [1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
          [0.4949212968349457, 0.5050786733627319, 0.0, 0.0, 0.0, 0.0],
          [0.32888710498809814, 0.33666837215423584, 0.3344445526599884, 0.0, 0.0, 0.0],
          [
            0.23951144516468048,
            0.2401704639196396,
            0.2393612116575241,
            0.2809569537639618,
            0.0,
            0.0
          ],
          [
            0.18544265627861023,
            0.1993032991886139,
            0.1982712745666504,
            0.22627480328083038,
            0.1907079666852951,
            0.0
          ],
          [
            0.15678176283836365,
            0.1527770310640335,
            0.15225917100906372,
            0.1877351701259613,
            0.15941700339317322,
            0.1910298764705658
          ]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(masked_neg_inf_attn_weights_causal, expected_masked_neg_inf_attn_weights_causal,
             atol: 1.0e-6
           )
           |> Nx.to_number() == 1,
           "masked_attn_weights_causal should match expected values"
  end

I hope that code comments were desctiptive enough, so you understood what is going on in casual attention.

Masking additional attention weights with dropout

In dropout, we randomly “drop” selected values from attention-weight tensors, or from the final context-vector tensor. Dropping a value means setting it to zero. In this example, we will drop values from the attention-weights tensor. Dropout is used only during the training phase. Why do we need to drop values? To prevent model overfitting, where the model could become overly reliant on a specific set of hidden-layer units.

Assume the dropout probability is p = 0.5, which means:

each visible attention weight has a 50% chance to be dropped
kept values are scaled by 1 / (1 - p) = 2.0
masked values stay zero

Below is one possible dropout-mask example. keep means the value survives dropout, and drop means it becomes 0.0.

Query \ Key	I	love	learning	with	tiny	models
I	keep	X	X	X	X	X
love	drop	keep	X	X	X	X
learning	keep	drop	keep	X	X	X
with	keep	keep	drop	keep	X	X
tiny	drop	keep	keep	drop	keep	X
models	keep	drop	keep	keep	drop	keep

Applying that mask to the causal attention weights and scaling the kept values by 2.0 gives this final dropout result:

Query \ Key	I	love	learning	with	tiny	models
I	2.00	0.00	0.00	0.00	0.00	0.00
love	0.00	1.56	0.00	0.00	0.00	0.00
learning	0.36	0.00	1.10	0.00	0.00	0.00
with	0.24	0.34	0.00	0.90	0.00	0.00
tiny	0.00	0.26	0.32	0.00	0.74	0.00
models	0.14	0.00	0.22	0.28	0.00	0.76

Here is the test that applies dropout:

test "dropout" do
    key = Nx.Random.key(123)

    # Set a {6 x 6} tensor filled with ones.

    example = Nx.broadcast(1.0, {6, 6})
    # Axon provides a dropout function.

    %Axon.StatefulOutput{output: dropped, state: %{"key" => _new_key}} =
      Axon.Layers.dropout(example, key, rate: 0.5, mode: :train)

    expected_dropped =
      Nx.tensor(
        [
          [0.0, 2.0, 0.0, 0.0, 0.0, 0.0],
          [0.0, 0.0, 0.0, 0.0, 2.0, 2.0],
          [0.0, 0.0, 0.0, 2.0, 2.0, 0.0],
          [2.0, 2.0, 0.0, 0.0, 0.0, 2.0],
          [2.0, 2.0, 0.0, 0.0, 2.0, 2.0],
          [0.0, 0.0, 2.0, 0.0, 2.0, 2.0]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(dropped, expected_dropped, atol: 1.0e-6) |> Nx.to_number() == 1,
           "dropped should match expected values"

    # Causal attention weights from the previous test.

    masked_neg_inf_attn_weights_causal =
      Nx.tensor(
        [
          [1.0, 0.0, 0.0, 0.0, 0.0, 0.0],
          [0.4949212968349457, 0.5050786733627319, 0.0, 0.0, 0.0, 0.0],
          [0.32888710498809814, 0.33666837215423584, 0.3344445526599884, 0.0, 0.0, 0.0],
          [
            0.23951144516468048,
            0.2401704639196396,
            0.2393612116575241,
            0.2809569537639618,
            0.0,
            0.0
          ],
          [
            0.18544265627861023,
            0.1993032991886139,
            0.1982712745666504,
            0.22627480328083038,
            0.1907079666852951,
            0.0
          ],
          [
            0.15678176283836365,
            0.1527770310640335,
            0.15225917100906372,
            0.1877351701259613,
            0.15941700339317322,
            0.1910298764705658
          ]
        ],
        type: {:f, 32}
      )

    %Axon.StatefulOutput{output: masked_attn_weights_causal_dropped, state: %{"key" => _new_key}} =
      Axon.Layers.dropout(masked_neg_inf_attn_weights_causal, key, rate: 0.5, mode: :train)

    expected_masked_neg_inf_attn_weights_causal_dropped =
      Nx.tensor(
        [
          [0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
          [0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
          [0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
          [0.47902289032936096, 0.4803409278392792, 0.0, 0.0, 0.0, 0.0],
          [0.37088531255722046, 0.3986065983772278, 0.0, 0.0, 0.3814159333705902, 0.0],
          [0.0, 0.0, 0.30451834201812744, 0.0, 0.31883400678634644, 0.3820597529411316]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(
             masked_neg_inf_attn_weights_causal_dropped,
             expected_masked_neg_inf_attn_weights_causal_dropped,
             atol: 1.0e-6
           )
           |> Nx.to_number() == 1,
           "masked_attn_weights_causal_dropped should match expected values"

  end

Implementing a compact causal attention Elixir module

We are implementing an Elixir module for the causal-attention and dropout steps, and then using this module in a test. For the test, we prepare a batch that consists of two identical inputs.

  defmodule LlmScratch.CausalAttention do
  @moduledoc """
  Causal self-attention over batched token sequences.

  Mirrors the PyTorch `CausalAttention` module:

    * query/key/value dense projections
    * causal upper-triangular mask stored in the module state
    * attention dropout
  """
  alias LlmScratch.SelfAttentionV2

  defstruct [
    :w_q,
    :w_k,
    :w_v,
    :mask,
    :d_in,
    :d_out,
    :context_length,
    :dropout,
    :qkv_bias,
    :seed
  ]

  @type dense_weights :: %{kernel: Nx.Tensor.t(), bias: Nx.Tensor.t()}

  @type t :: %__MODULE__{
          w_q: dense_weights(),
          w_k: dense_weights(),
          w_v: dense_weights(),
          mask: Nx.Tensor.t(),
          d_in: pos_integer(),
          d_out: pos_integer(),
          context_length: pos_integer(),
          dropout: float(),
          qkv_bias: boolean(),
          seed: integer()
        }

  @spec new(pos_integer(), pos_integer(), pos_integer(), number()) :: t()
  @doc """
  Creates a causal attention module with default `qkv_bias: false`.

  Equivalent to `new(d_in, d_out, context_length, dropout, false, [])`.
  """
  def new(d_in, d_out, context_length, dropout),
    do: new(d_in, d_out, context_length, dropout, false, [])

  @spec new(pos_integer(), pos_integer(), pos_integer(), number(), boolean()) :: t()
  @doc """
  Creates a causal attention module with explicit `qkv_bias`.

  Equivalent to `new(d_in, d_out, context_length, dropout, qkv_bias, [])`.
  """
  def new(d_in, d_out, context_length, dropout, qkv_bias),
    do: new(d_in, d_out, context_length, dropout, qkv_bias, [])

  @spec new(pos_integer(), pos_integer(), pos_integer(), number(), boolean(), keyword()) :: t()
  @doc """
  Creates a causal attention module.

  ## Arguments

    * `d_in` - input feature size.
    * `d_out` - projection/output feature size.
    * `context_length` - maximum sequence length for the causal mask.
    * `dropout` - dropout rate in `[0, 1)`, applied to attention weights.
    * `qkv_bias` - whether query/key/value dense layers use bias.
    * `opts` - keyword options:
      `:seed` (optional, deterministic initialization).

  ## Notes

  `mask` in the struct is the Elixir/Nx equivalent of PyTorch `register_buffer`
  for the upper-triangular causal mask.
  """
  def new(d_in, d_out, context_length, dropout, qkv_bias, opts)
      when is_integer(d_in) and d_in > 0 and is_integer(d_out) and d_out > 0 and
             is_integer(context_length) and context_length > 0 do
    seed = normalize_seed(Keyword.get(opts, :seed))
    qkv_bias = SelfAttentionV2.normalize_qkv_bias(qkv_bias)
    dropout = normalize_dropout(dropout)

    w_q = SelfAttentionV2.init_dense_weights(d_in, d_out, seed, qkv_bias, "q_proj")
    w_k = SelfAttentionV2.init_dense_weights(d_in, d_out, seed, qkv_bias, "k_proj")
    w_v = SelfAttentionV2.init_dense_weights(d_in, d_out, seed, qkv_bias, "v_proj")

    mask =
      Nx.broadcast(1.0, {context_length, context_length})
      |> Nx.triu(k: 1)
      |> Nx.as_type({:f, 32})

    %__MODULE__{
      w_q: w_q,
      w_k: w_k,
      w_v: w_v,
      mask: mask,
      d_in: d_in,
      d_out: d_out,
      context_length: context_length,
      dropout: dropout,
      qkv_bias: qkv_bias,
      seed: seed
    }
  end

  @spec forward(t(), Nx.Tensor.t(), keyword()) :: Nx.Tensor.t()
  @doc """
  Computes causal self-attention context vectors for a batched input.

  ## Arguments

    * `ca` - `%LlmScratch.CausalAttention{}` module state.
    * `x` - input tensor of shape `{batch_size, num_tokens, d_in}`.
    * `opts` - keyword options:
      `:mode` (`:train` or `:inference`, default `:train`),
      `:key` (optional Nx random key for dropout when in train mode).

  ## Returns

    * context tensor of shape `{batch_size, num_tokens, d_out}`.
  """
  def forward(%__MODULE__{} = ca, %Nx.Tensor{} = x, opts \\ []) do
    {batch_size, num_tokens, _} = validate_input_shape!(x, ca.d_in)
    validate_context_length!(num_tokens, ca.context_length)

    keys = SelfAttentionV2.dense_project(x, ca.w_k)
    queries = SelfAttentionV2.dense_project(x, ca.w_q)
    values = SelfAttentionV2.dense_project(x, ca.w_v)

    # {no_of_batches, num_tokens, d_out} dot {no_of_batches, num_tokens, d_out}
    # we are dotting tensors over axis 2 (last one), [0] means that batch is on axis 0
    # so for each batch we are dotting {num_tokens, d_out} dot {num_tokens, d_out}

    attn_scores = Nx.dot(queries, [2], [0], keys, [2], [0])

    mask =
      ca.mask
      |> Nx.slice([0, 0], [num_tokens, num_tokens])
      |> Nx.greater(0.0)
      |> Nx.broadcast({batch_size, num_tokens, num_tokens})

    neg_inf = Nx.broadcast(:neg_infinity, Nx.shape(attn_scores))
    masked_scores = Nx.select(mask, neg_inf, attn_scores)

    attn_weights =
      masked_scores
      |> Nx.divide(Nx.sqrt(Nx.axis_size(keys, -1)))
      |> Axon.Activations.softmax(axis: -1)
      |> maybe_dropout(ca, opts)

    # {no_of_batches, no_of_tokens, no_of_tokens} dot {no_of_batches, no_of_tokens, d_out}
    # {no_of_batches, no_of_tokens, d_out}

    Nx.dot(attn_weights, [2], [0], values, [1], [0])
  end

  defp maybe_dropout(attn_weights, %{dropout: dropout}, _opts) when dropout <= 0.0,
    do: attn_weights

  defp maybe_dropout(attn_weights, %{dropout: dropout, seed: seed}, opts) do
    mode = Keyword.get(opts, :mode, :train)

    if mode == :train do
      key = Keyword.get(opts, :key) || Nx.Random.key(seed)

      %Axon.StatefulOutput{output: dropped, state: %{"key" => _new_key}} =
        Axon.Layers.dropout(attn_weights, key, rate: dropout, mode: :train)

      dropped
    else
      attn_weights
    end
  end

  defp normalize_seed(nil), do: System.unique_integer([:positive])
  defp normalize_seed(seed) when is_integer(seed), do: seed

  defp normalize_seed(seed) do
    raise ArgumentError, "seed must be an integer or nil, got: #{inspect(seed)}"
  end

  defp normalize_dropout(dropout) when is_number(dropout) and dropout >= 0 and dropout < 1 do
    dropout * 1.0
  end

  defp normalize_dropout(dropout) do
    raise ArgumentError, "dropout must be a number in [0, 1), got: #{inspect(dropout)}"
  end

  defp validate_input_shape!(inputs, expected_d_in) do
    case Nx.shape(inputs) do
      {batch_size, num_tokens, ^expected_d_in} ->
        {batch_size, num_tokens, expected_d_in}

      shape ->
        raise ArgumentError,
              "expected inputs shape {batch_size, num_tokens, #{expected_d_in}}, got: #{inspect(shape)}"
    end
  end

  defp validate_context_length!(num_tokens, context_length) when num_tokens <= context_length,
    do: :ok

  defp validate_context_length!(num_tokens, context_length) do
    raise ArgumentError,
          "num_tokens (#{num_tokens}) exceeds context_length (#{context_length})"
  end
end

test "causal attention matches stacked batch example" do
    inputs =
      Nx.tensor(
        [
          [0.43, 0.15, 0.89],
          [0.55, 0.87, 0.66],
          [0.57, 0.85, 0.64],
          [0.22, 0.58, 0.33],
          [0.77, 0.25, 0.10],
          [0.05, 0.80, 0.55]
        ],
        type: {:f, 32}
      )

    d_in = 3
    d_out = 2

    # batch dimension is {2,6,3}

    batch = Nx.stack([inputs, inputs], axis: 0)

    #context_length is 6

    context_length = elem(Nx.shape(batch), 1)
    ca = LlmScratch.CausalAttention.new(d_in, d_out, context_length, 0.0, false, seed: 123)

    context_vecs = LlmScratch.CausalAttention.forward(ca, batch, mode: :inference)

    # we projected dimension third axis 3 => 2

    assert Nx.shape(context_vecs) == {2, 6, 2}

    # we batched same input twice, which means that shape on axis: 0 is 2, as first outermost list has two elements

    expected_context_vecs =
      Nx.tensor(
        [
          [
            [-0.49523380398750305, -0.17632800340652466],
            [-0.07537277787923813, -0.13790269196033478],
            [0.06633053719997406, -0.12039512395858765],
            [0.11786159127950668, -0.10831516981124878],
            [0.1877504140138626, -0.04864511638879776],
            [0.1768769919872284, -0.08047633618116379]
          ],
          [
            [-0.49523380398750305, -0.17632800340652466],
            [-0.07537277787923813, -0.13790269196033478],
            [0.06633053719997406, -0.12039512395858765],
            [0.11786159127950668, -0.10831516981124878],
            [0.1877504140138626, -0.04864511638879776],
            [0.1768769919872284, -0.08047633618116379]
          ]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(context_vecs, expected_context_vecs, atol: 1.0e-6) |> Nx.to_number() == 1
  end

Extending single-head attention to multi-head attention

Warning: This part is really hard!
This section covers multi-head attention, which is one of the most complex and subtle parts in understanding transformers and LLMs. Take your time and revisit the section as needed—the payoff is worth it!

So what are those heads? One head is one causal attention module, LlmScratch.CausalAttention.new, that processes one input at a time. Using a New Kids on the Block approach, step by step, we will first implement a multi-head causal attention module as a stack of causal attention modules.

Stacking multiple single-head attention layers

Let’s stack some causal attention modules (additional info is in the comments):

defmodule LlmScratch.MultiheadAttentionWrapper do
  @moduledoc """
  Thin wrapper that builds multiple independent causal-attention heads and
  concatenates their outputs on the last axis.

  Mirrors the PyTorch module:

      class MultiHeadAttentionWrapper(nn.Module):
          def __init__(self, d_in, d_out, context_length, dropout, num_heads, qkv_bias=False):
              super().__init__()
              self.heads = nn.ModuleList(
                  [CausalAttention(d_in, d_out, context_length, dropout, qkv_bias)
                   for _ in range(num_heads)]
              )

          def forward(self, x):
              return torch.cat([head(x) for head in self.heads], dim=-1)
  """
  alias LlmScratch.CausalAttention

  defstruct [
    :heads,
    :d_in,
    :d_out,
    :context_length,
    :dropout,
    :num_heads,
    :qkv_bias,
    :seed
  ]

  @type t :: %__MODULE__{
          heads: [CausalAttention.t()],
          d_in: pos_integer(),
          d_out: pos_integer(),
          context_length: pos_integer(),
          dropout: float(),
          num_heads: pos_integer(),
          qkv_bias: boolean(),
          seed: integer()
        }

  @spec new(pos_integer(), pos_integer(), pos_integer(), number(), pos_integer()) :: t()
  @doc """
  Creates a wrapper with default `qkv_bias: false`.

  Equivalent to `new(d_in, d_out, context_length, dropout, num_heads, false, [])`.
  """
  def new(d_in, d_out, context_length, dropout, num_heads),
    do: new(d_in, d_out, context_length, dropout, num_heads, false, [])

  @spec new(pos_integer(), pos_integer(), pos_integer(), number(), pos_integer(), boolean()) ::
          t()
  @doc """
  Creates a wrapper with explicit `qkv_bias`.

  Equivalent to `new(d_in, d_out, context_length, dropout, num_heads, qkv_bias, [])`.
  """
  def new(d_in, d_out, context_length, dropout, num_heads, qkv_bias),
    do: new(d_in, d_out, context_length, dropout, num_heads, qkv_bias, [])

  @spec new(
          pos_integer(),
          pos_integer(),
          pos_integer(),
          number(),
          pos_integer(),
          boolean(),
          keyword()
        ) ::
          t()
  @doc """
  Creates a wrapper containing `num_heads` independent causal-attention heads.

  Unlike `LlmScratch.MultiheadAttention`, this module does not share Q/K/V
  projections across heads. Each head is a full `LlmScratch.CausalAttention`
  instance with output size `d_out`, so the concatenated output width is
  `num_heads * d_out`.

  ## Arguments

    * `d_in` - input feature size for each token
    * `d_out` - output feature size produced by each head
    * `context_length` - maximum sequence length supported by each head
    * `dropout` - dropout rate in `[0, 1)`, applied inside each head
    * `num_heads` - number of independent causal-attention heads
    * `qkv_bias` - whether each head's query/key/value projections use bias
    * `opts` - keyword options for initialization

  ## Options

    * `:seed` - deterministic base seed. Each head uses `seed + head_index`.
  """
  def new(d_in, d_out, context_length, dropout, num_heads, qkv_bias, opts)
      when is_integer(d_in) and d_in > 0 and is_integer(d_out) and d_out > 0 and
             is_integer(context_length) and context_length > 0 and is_integer(num_heads) and
             num_heads > 0 do
    seed = normalize_seed(Keyword.get(opts, :seed))
    dropout = normalize_dropout(dropout)
    qkv_bias = normalize_qkv_bias(qkv_bias)

    heads =
      for index <- 0..(num_heads - 1) do
        CausalAttention.new(
          d_in,
          d_out,
          context_length,
          dropout,
          qkv_bias,
          seed: seed + index
        )
      end

    %__MODULE__{
      heads: heads,
      d_in: d_in,
      d_out: d_out,
      context_length: context_length,
      dropout: dropout,
      num_heads: num_heads,
      qkv_bias: qkv_bias,
      seed: seed
    }
  end

  @spec forward(t(), Nx.Tensor.t(), keyword()) :: Nx.Tensor.t()
  @doc """
  Runs all heads on the same input and concatenates their context vectors on the
  last axis.

  ## Arguments

    * `wrapper` - `%LlmScratch.MultiheadAttentionWrapper{}`
    * `x` - input tensor of shape `{batch_size, num_tokens, d_in}`
    * `opts` - same forward options accepted by
      `LlmScratch.CausalAttention.forward/3`

  ## Returns

    * tensor of shape `{batch_size, num_tokens, num_heads * d_out}`
  """
  def forward(%__MODULE__{heads: heads}, %Nx.Tensor{} = x, opts \\ []) do
    heads
    |> Enum.map(&CausalAttention.forward(&1, x, opts))
    |> Nx.concatenate(axis: -1)
  end

  defp normalize_seed(nil), do: System.unique_integer([:positive])
  defp normalize_seed(seed) when is_integer(seed), do: seed

  defp normalize_seed(seed) do
    raise ArgumentError, "seed must be an integer or nil, got: #{inspect(seed)}"
  end

  defp normalize_dropout(dropout) when is_number(dropout) and dropout >= 0 and dropout < 1 do
    dropout * 1.0
  end

  defp normalize_dropout(dropout) do
    raise ArgumentError, "dropout must be a number in [0, 1), got: #{inspect(dropout)}"
  end

  defp normalize_qkv_bias(qkv_bias) when is_boolean(qkv_bias), do: qkv_bias

  defp normalize_qkv_bias(qkv_bias) do
    raise ArgumentError, "qkv_bias must be a boolean, got: #{inspect(qkv_bias)}"
  end
end

Nothing fancy here. We have a new input attribute, num_heads. In forward, we sequentially calculate causal attention outputs and concatenate the results over the last dimension:

a =
  Nx.tensor([
    [[1, 2], [3, 4]],
    [[5, 6], [7, 8]]
  ])

b =
  Nx.tensor([
    [[10, 20], [30, 40]],
    [[50, 60], [70, 80]]
  ])

result = Nx.concatenate([a, b], axis: -1)

# Shapes:

# a: {2, 2, 2}
# b: {2, 2, 2}
# result: {2, 2, 4}

#Nx.Tensor<
  s64[2][2][4]
  [
    [
      [1, 2, 10, 20],
      [3, 4, 30, 40]
    ],
    [
      [5, 6, 50, 60],
      [7, 8, 70, 80]
    ]
  ]
>

There is room for improvement in that sequential part. Here is a test with this wrapper in action:

test "multihead attention wrapper concatenates head outputs on the last axis" do
    # six tokens with dimension 3
    
    input =
      Nx.tensor(
        [
          [0.43, 0.15, 0.89],
          [0.55, 0.87, 0.66],
          [0.57, 0.85, 0.64],
          [0.22, 0.58, 0.33],
          [0.77, 0.25, 0.10],
          [0.05, 0.80, 0.55]
        ],
        type: {:f, 32}
      )

    # make a batch of two identical token sequences: {2x6x3}

    inputs = Nx.stack([input, input], axis: 0)
    context_length = elem(Nx.shape(inputs), 1)
    # number of heads is 2

    mha = LlmScratch.MultiheadAttentionWrapper.new(3, 2, context_length, 0.0, 2, false, seed: 123)

    # {2, 6, 4}
    result = LlmScratch.MultiheadAttentionWrapper.forward(mha, inputs, mode: :inference)

    expected_result =
      Nx.tensor(
        [
          [
            [-0.49523380398750305, -0.17632800340652466, -0.2037302404642105, 0.2859067916870117],
            [
              -0.07537277787923813,
              -0.13790269196033478,
              -0.04826965555548668,
              0.23433709144592285
            ],
            [
              0.06633053719997406,
              -0.12039512395858765,
              0.029345838353037834,
              0.21299715340137482
            ],
            [0.11786159127950668, -0.10831516981124878, 0.08328337967395782, 0.15773256123065948],
            [0.1877504140138626, -0.04864511638879776, 0.12333470582962036, 0.1985599249601364],
            [0.1768769919872284, -0.08047633618116379, 0.12170650064945221, 0.146833136677742]
          ],
          [
            [-0.49523380398750305, -0.17632800340652466, -0.2037302404642105, 0.2859067916870117],
            [
              -0.07537277787923813,
              -0.13790269196033478,
              -0.04826965555548668,
              0.23433709144592285
            ],
            [
              0.06633053719997406,
              -0.12039512395858765,
              0.029345838353037834,
              0.21299715340137482
            ],
            [0.11786159127950668, -0.10831516981124878, 0.08328337967395782, 0.15773256123065948],
            [0.1877504140138626, -0.04864511638879776, 0.12333470582962036, 0.1985599249601364],
            [0.1768769919872284, -0.08047633618116379, 0.12170650064945221, 0.146833136677742]
          ]
        ],
        type: {:f, 32}
      )

    # We have two inputs, six tokens, and four dimensions for each token.
    # The input has three dimensions because we project inputs to out_dim = 2,
    # and we have two heads that we concatenate over the last dimension: 2 + 2 = 4.

    assert Nx.shape(result) == {2, 6, 4}
    assert Nx.all_close(result, expected, atol: 1.0e-6) |> Nx.to_number() == 1
  end

Then, in the book, we get an exercise:

Change the input arguments for the MultiHeadAttentionWrapper(…, num_heads=2) call so that the output context vectors are two-dimensional instead of four dimensional while keeping the setting num_heads=2. Hint: You don’t have to modify the class implementation; you just have to change one of the other input arguments.

test "exercise 3.2 returns two-dimensional embedding vectors with two heads" do
    input =
      Nx.tensor(
        [
          [0.43, 0.15, 0.89],
          [0.55, 0.87, 0.66],
          [0.57, 0.85, 0.64],
          [0.22, 0.58, 0.33],
          [0.77, 0.25, 0.10],
          [0.05, 0.80, 0.55]
        ],
        type: {:f, 32}
      )

    inputs = Nx.stack([input, input], axis: 0)
    context_length = elem(Nx.shape(inputs), 1)

    # We set d_out = 1.

    mha = LlmScratch.MultiheadAttentionWrapper.new(3, 1, context_length, 0.0, 2, false, seed: 123)

    result = LlmScratch.MultiheadAttentionWrapper.forward(mha, inputs, mode: :inference)

    assert Nx.shape(result) == {2, 6, 2}
  end

As we mentioned, the improvements will be in how we handle sequential heads. But we will not use Elixir concurrency; there is a smarter way to do it. Let’s dig into the hardest part of the book so far.

Implementing multi-head attention with weight splits

The first step is to combine the two modules we have: LlmScratch.MultiheadAttentionWrapper and LlmScratch.CausalAttention:

defmodule LlmScratch.MultiheadAttention do
  @moduledoc """
  Multi-head causal self-attention with shared Q/K/V projections and an output
  projection.

  Mirrors the PyTorch `MultiHeadAttention` module from chapter 3:

    * one query/key/value projection each with total size `d_out`
    * head splitting into `num_heads` chunks of size `head_dim`
    * causal masking over token positions
    * attention dropout
    * output projection that mixes the concatenated head outputs

  This module is different from
  `LlmScratch.MultiheadAttentionWrapper`:

    * `MultiheadAttentionWrapper` builds `num_heads` separate
      `LlmScratch.CausalAttention` modules and concatenates their outputs
    * `MultiheadAttention` uses one set of Q/K/V projections of size `d_out`,
      splits those projections into heads, runs attention per head, then applies
      one final output projection

  The expected input shape is:

      {batch_size, num_tokens, d_in}

  and the returned output shape is:

      {batch_size, num_tokens, d_out}

  Internally, projected queries, keys, and values move through these shapes:

      {batch_size, num_tokens, d_out}
      -> {batch_size, num_tokens, num_heads, head_dim}
      -> {batch_size, num_heads, num_tokens, head_dim}

  where `head_dim = div(d_out, num_heads)`.
  """

  alias LlmScratch.SelfAttentionV2

  defstruct [
    :w_q,
    :w_k,
    :w_v,
    :out_proj,
    :mask,
    :d_in,
    :d_out,
    :context_length,
    :dropout,
    :num_heads,
    :head_dim,
    :qkv_bias,
    :seed
  ]

  @type dense_weights :: %{kernel: Nx.Tensor.t(), bias: Nx.Tensor.t()}

  @type t :: %__MODULE__{
          w_q: dense_weights(),
          w_k: dense_weights(),
          w_v: dense_weights(),
          out_proj: dense_weights(),
          mask: Nx.Tensor.t(),
          d_in: pos_integer(),
          d_out: pos_integer(),
          context_length: pos_integer(),
          dropout: float(),
          num_heads: pos_integer(),
          head_dim: pos_integer(),
          qkv_bias: boolean(),
          seed: integer()
        }

  @spec new(pos_integer(), pos_integer(), pos_integer(), number(), pos_integer()) :: t()
  def new(d_in, d_out, context_length, dropout, num_heads),
    do: new(d_in, d_out, context_length, dropout, num_heads, false, [])

  @spec new(pos_integer(), pos_integer(), pos_integer(), number(), pos_integer(), boolean()) ::
          t()
  def new(d_in, d_out, context_length, dropout, num_heads, qkv_bias),
    do: new(d_in, d_out, context_length, dropout, num_heads, qkv_bias, [])

  @spec new(
          pos_integer(),
          pos_integer(),
          pos_integer(),
          number(),
          pos_integer(),
          boolean(),
          keyword()
        ) ::
          t()
  @doc """
  Creates a multi-head causal attention module.

  `d_out` must be divisible by `num_heads`, because each head receives an equal
  slice of the projected feature dimension.

  The Q/K/V projections are initialized with output size `d_out`, the
  per-head size is stored in `head_dim`, and `out_proj` maps the concatenated
  head outputs back into `d_out`.

  ## Arguments

    * `d_in` - input embedding dimension for each token
    * `d_out` - total projected output dimension across all heads
    * `context_length` - maximum sequence length supported by the causal mask
    * `dropout` - dropout rate applied to attention weights during training
    * `num_heads` - number of attention heads
    * `qkv_bias` - whether the query, key, and value projections use bias
    * `opts` - keyword options for initialization

  ## Options

    * `:seed` - deterministic initialization seed
  """
  def new(d_in, d_out, context_length, dropout, num_heads, qkv_bias, opts)
      when is_integer(d_in) and d_in > 0 and is_integer(d_out) and d_out > 0 and
             is_integer(context_length) and context_length > 0 and is_integer(num_heads) and
             num_heads > 0 do
    if rem(d_out, num_heads) != 0 do
      raise ArgumentError, "d_out must be divisible by num_heads"
    end

    seed = normalize_seed(Keyword.get(opts, :seed))
    qkv_bias = SelfAttentionV2.normalize_qkv_bias(qkv_bias)
    dropout = normalize_dropout(dropout)
    head_dim = div(d_out, num_heads)

    w_q = SelfAttentionV2.init_dense_weights(d_in, d_out, seed, qkv_bias, "q_proj")
    w_k = SelfAttentionV2.init_dense_weights(d_in, d_out, seed, qkv_bias, "k_proj")
    w_v = SelfAttentionV2.init_dense_weights(d_in, d_out, seed, qkv_bias, "v_proj")
    out_proj = SelfAttentionV2.init_dense_weights(d_out, d_out, seed, true, "out_proj")

    mask =
      Nx.broadcast(1.0, {context_length, context_length})
      |> Nx.triu(k: 1)
      |> Nx.as_type({:f, 32})

    %__MODULE__{
      w_q: w_q,
      w_k: w_k,
      w_v: w_v,
      out_proj: out_proj,
      mask: mask,
      d_in: d_in,
      d_out: d_out,
      context_length: context_length,
      dropout: dropout,
      num_heads: num_heads,
      head_dim: head_dim,
      qkv_bias: qkv_bias,
      seed: seed
    }
  end

  @spec forward(t(), Nx.Tensor.t(), keyword()) :: Nx.Tensor.t()
  @doc """
  Computes causal multi-head self-attention for batched token sequences.

  ## Arguments

    * `mha` - `%LlmScratch.MultiheadAttention{}`
    * `x` - input tensor of shape `{batch_size, num_tokens, d_in}`
    * `opts` - forward options

  ## Forward Options

    * `:mode` - `:train` or `:inference`; dropout is only applied in train mode
    * `:key` - optional `Nx.Random` key used by dropout in train mode

  ## Steps

    * project `x` into queries, keys, and values of shape
      `{batch_size, num_tokens, d_out}`
    * split the last dimension into `num_heads * head_dim`
    * compute attention scores independently for each head
    * apply the causal mask so tokens cannot attend to future tokens
    * softmax and optionally apply dropout to the attention weights
    * combine attended values across heads and reshape back to
      `{batch_size, num_tokens, d_out}`
    * apply `out_proj`

  ## Returns

    * tensor of shape `{batch_size, num_tokens, d_out}`
  """
  def forward(%__MODULE__{} = mha, %Nx.Tensor{} = x, opts \\ []) do
    {batch_size, num_tokens, _} = validate_input_shape!(x, mha.d_in)
    validate_context_length!(num_tokens, mha.context_length)

    keys =
      x
      |> SelfAttentionV2.dense_project(mha.w_k)
      |> split_heads(batch_size, num_tokens, mha.num_heads, mha.head_dim)

    queries =
      x
      |> SelfAttentionV2.dense_project(mha.w_q)
      |> split_heads(batch_size, num_tokens, mha.num_heads, mha.head_dim)

    values =
      x
      |> SelfAttentionV2.dense_project(mha.w_v)
      |> split_heads(batch_size, num_tokens, mha.num_heads, mha.head_dim)

    attn_scores = Nx.dot(queries, [3], [0, 1], keys, [3], [0, 1])

    mask =
      mha.mask
      |> Nx.slice([0, 0], [num_tokens, num_tokens])
      |> Nx.greater(0.0)
      |> Nx.new_axis(0)
      |> Nx.new_axis(0)
      |> Nx.broadcast({batch_size, mha.num_heads, num_tokens, num_tokens})

    neg_inf = Nx.broadcast(:neg_infinity, Nx.shape(attn_scores))
    masked_scores = Nx.select(mask, neg_inf, attn_scores)

    context =
      masked_scores
      |> Nx.divide(Nx.sqrt(mha.head_dim))
      |> Axon.Activations.softmax(axis: -1)
      |> maybe_dropout(mha, opts)
      |> Nx.dot([3], [0, 1], values, [2], [0, 1])
      |> Nx.transpose(axes: [0, 2, 1, 3])
      |> Nx.reshape({batch_size, num_tokens, mha.d_out})

    SelfAttentionV2.dense_project(context, mha.out_proj)
  end

  @doc false
  defp split_heads(tensor, batch_size, num_tokens, num_heads, head_dim) do
    tensor
    |> Nx.reshape({batch_size, num_tokens, num_heads, head_dim})
    |> Nx.transpose(axes: [0, 2, 1, 3])
  end

  defp maybe_dropout(attn_weights, %{dropout: dropout}, _opts) when dropout <= 0.0,
    do: attn_weights

  defp maybe_dropout(attn_weights, %{dropout: dropout, seed: seed}, opts) do
    mode = Keyword.get(opts, :mode, :train)

    if mode == :train do
      key = Keyword.get(opts, :key) || Nx.Random.key(seed)

      %Axon.StatefulOutput{output: dropped, state: %{"key" => _new_key}} =
        Axon.Layers.dropout(attn_weights, key, rate: dropout, mode: :train)

      dropped
    else
      attn_weights
    end
  end

  defp normalize_seed(nil), do: System.unique_integer([:positive])
  defp normalize_seed(seed) when is_integer(seed), do: seed

  defp normalize_seed(seed) do
    raise ArgumentError, "seed must be an integer or nil, got: #{inspect(seed)}"
  end

  defp normalize_dropout(dropout) when is_number(dropout) and dropout >= 0 and dropout < 1 do
    dropout * 1.0
  end

  defp normalize_dropout(dropout) do
    raise ArgumentError, "dropout must be a number in [0, 1), got: #{inspect(dropout)}"
  end

  defp validate_input_shape!(inputs, expected_d_in) do
    case Nx.shape(inputs) do
      {batch_size, num_tokens, ^expected_d_in} ->
        {batch_size, num_tokens, expected_d_in}

      shape ->
        raise ArgumentError,
              "expected inputs shape {batch_size, num_tokens, #{expected_d_in}}, got: #{inspect(shape)}"
    end
  end

  defp validate_context_length!(num_tokens, context_length) when num_tokens <= context_length,
    do: :ok

  defp validate_context_length!(num_tokens, context_length) do
    raise ArgumentError,
          "num_tokens (#{num_tokens}) exceeds context_length (#{context_length})"
  end
end

In the forward function, the first step is familiar from SelfAttentionV2; there is nothing new here. Step 2 is new, so what does it mean to split the last dimension? Here is an explanation with an example.
split_heads/5 does two operations:

tensor
|> Nx.reshape({batch_size, num_tokens, num_heads, head_dim})
|> Nx.transpose(axes: [0, 2, 1, 3])

Its job is to take the last dimension, which is d_out, and split it into:

d_out = num_heads * head_dim

Example

Assume:

batch_size = 1
num_tokens = 2
num_heads = 2
head_dim = 2

So input tensor shape is:

{1, 2, 4}

Example tensor:

tensor =
  Nx.tensor([
    [
      [1.0, 2.0, 3.0, 4.0],
      [5.0, 6.0, 7.0, 8.0]
    ]
  ])

This means:

batch 0
token 0 has features [1, 2, 3, 4]
token 1 has features [5, 6, 7, 8]

Step 1: reshape

Nx.reshape(tensor, {1, 2, 2, 2})

Result:

[
  [
    [[1.0, 2.0], [3.0, 4.0]],
    [[5.0, 6.0], [7.0, 8.0]]
  ]
]

So we have four nested lists. The first list has one element, the second list has two elements, the third list has two elements, and the final list has two elements. This gives us the shape {1, 2, 2, 2}.

Shape is now:

{batch_size, num_tokens, num_heads, head_dim}
= {1, 2, 2, 2}

Interpretation:

token 0:
- head 0 gets [1, 2]
- head 1 gets [3, 4]
token 1:
- head 0 gets [5, 6]
- head 1 gets [7, 8]

Step 2: transpose

Nx.transpose(axes: [0, 2, 1, 3])

This changes dimension order from:

{batch, tokens, heads, head_dim}

to:

{batch, heads, tokens, head_dim}

Result:

[
  [
    [[1.0, 2.0], [5.0, 6.0]],
    [[3.0, 4.0], [7.0, 8.0]]
  ]
]

The shape numbers are still {1, 2, 2, 2}, but the meaning is different:

first head now contains all tokens:
- [[1, 2], [5, 6]]
second head now contains all tokens:
- [[3, 4], [7, 8]]

Why transpose?

Because attention is computed per head. After transpose, each head has its own sequence of token vectors grouped together, which is the layout needed for the later Nx.dot call.

So split_heads does this:

Break each token’s full feature vector into head-sized chunks
Rearrange so data is grouped by head instead of by token

Then we calculate:

attn_scores = Nx.dot(queries, [3], [0, 1], keys, [3], [0, 1])

This is the dot product of {1, 2, 2, 2} and {1, 2, 2, 2}. We do it over the last axis, 3, for the tensor that starts at axes 0 and 1:

    [[1.0, 2.0], [5.0, 6.0]],
    [[3.0, 4.0], [7.0, 8.0]]

Then we compute masked_scores, just as we did in CausalAttention.

Suppose one head’s attn_scores is:

[
  [1.2, 0.8, 0.5, 0.1],
  [0.3, 1.1, 0.7, 0.2],
  [0.4, 0.6, 1.5, 0.9],
  [0.2, 0.3, 0.4, 1.0]
]

Then we get:

[
  [1.2, -inf, -inf, -inf],
  [0.3, 1.1, -inf, -inf],
  [0.4, 0.6, 1.5, -inf],
  [0.2, 0.3, 0.4, 1.0]
]

That is the input for the normalization, softmax, and dropout steps from CausalAttention, which we then apply in a dot product with values to get att_weights. The final step is the opposite of what we did in split_heads: transpose and reshape to combine the heads back into the original three-axis shape. After that, we apply a linear projection.

How did we enhance computational efficiency in LlmScratch.MultiheadAttention compared with LlmScratch.MultiheadAttentionWrapper? Because we calculate keys, queries, and values only once. And this is the most computationally intensive part of multi-head attention.

Let’s use LlmScratch.MultiheadAttention in a test:

test "multihead attention combines split heads and output projection in one module" do
    inputs =
      Nx.tensor(
        [
          [0.43, 0.15, 0.89],
          [0.55, 0.87, 0.66],
          [0.57, 0.85, 0.64],
          [0.22, 0.58, 0.33],
          [0.77, 0.25, 0.10],
          [0.05, 0.80, 0.55]
        ],
        type: {:f, 32}
      )

    inputs = Nx.stack([inputs, inputs], axis: 0)

    mha = LlmScratch.MultiheadAttention.new(3, 2, 6, 0.0, 2, false, seed: 123)

    result = LlmScratch.MultiheadAttention.forward(mha, inputs, mode: :inference)

    expected =
      Nx.tensor(
        [
          [
            [-0.47695064544677734, -0.23235172033309937],
            [-0.10124677419662476, -0.023370809853076935],
            [0.031040165573358536, 0.04926960915327072],
            [0.07956627011299133, 0.07443806529045105],
            [0.15488147735595703, 0.10296830534934998],
            [0.14108411967754364, 0.10342075675725937]
          ],
          [
            [-0.47695064544677734, -0.23235172033309937],
            [-0.10124677419662476, -0.023370809853076935],
            [0.031040165573358536, 0.04926960915327072],
            [0.07956627011299133, 0.07443806529045105],
            [0.15488147735595703, 0.10296830534934998],
            [0.14108411967754364, 0.10342075675725937]
          ]
        ],
        type: {:f, 32}
      )

    assert Nx.shape(result) == {2, 6, 2}
    assert Nx.all_close(result, expected, atol: 1.0e-6) |> Nx.to_number() == 1
  end

And here it is with real-world GPT-2 parameter values, where we call EXLA for help:

  test "exercise 3.3 initializes a GPT-2 small attention module" do
    previous_backend = Nx.default_backend()
    Nx.default_backend(EXLA.Backend)
    on_exit(fn -> Nx.default_backend(previous_backend) end)

    mha = LlmScratch.MultiheadAttention.new(768, 768, 1024, 0.0, 12, false, seed: 123)

    assert mha.d_in == 768
    assert mha.d_out == 768
    assert mha.context_length == 1024
    assert mha.num_heads == 12
    assert mha.head_dim == 64
    assert mha.qkv_bias == false
    assert Nx.shape(mha.mask) == {1024, 1024}
    assert Nx.shape(mha.w_q.kernel) == {768, 768}
    assert Nx.shape(mha.w_k.kernel) == {768, 768}
    assert Nx.shape(mha.w_v.kernel) == {768, 768}
    assert Nx.shape(mha.out_proj.kernel) == {768, 768}
  end

Conclusion

We started with SimpleAttention and finished with MultiheadAttention, which we will use later in the book. For me, this section was really hard, and I will still need several rereads before I can explain it to someone else.

Workbook Answers on Chapter 2, Build a LLM from Scratch, Working with Text Data

2026-03-01T23:00:00+00:00

TL;DR

Workbook answers and notes for Chapter 2 of Build a LLM from Scratch by Sebastian Raschka, focused on working with text data (tokenization, token IDs, embeddings, and context windows). But first, let’s reread Giles’s blog posts about Chapter 2 (part 2, part 3).

Giles’s blog posts

In part 2, Giles briefly explains byte-pair encoding with tiktoken. tiktoken starts with letters, numbers, and punctuation, then learns word patterns from the input and creates smarter tokens (in my blog post, I mentioned the pattern we can find in words like define and delfine, …). It seems that Giles read the byte-pair encoding paper that I skipped.
The second thing is Python tiktoken’s allowed_special option, which we use when passing special tokens like |endoftext|. If it encounters special tokens that are not in that list, it raises an error. Elixir tiktoken also accepts special tokens as input, but it does not raise an error. Here is a ChatGPT explanation of the problem in the Elixir tiktoken wrapper.
In part 3, Giles nicely explains the concept of batches and why we need them. Batches are just how we stack {input, target} pairs. One batch is a list of {input, target} pairs, where the number of elements in that list is batch_size. The second batch is the next such list. Doing LLM-related calculations on batches is faster than running calculations on each pair individually.
One-hot embeddings are also explained, and this is actually a pre-step to the embedding data explained in the book.
I learned something new from Giles’s understanding of Chapter 2. Let’s do the workbook questions!

Workbook multiple choice questions

What is the primary purpose of tokenization in the context of LLMs?

A. Tokenization is used to convert text into lowercase.
B. Tokenization splits text into individual words or special characters
C. Tokenization is used to identify the parts of speech in a sentence.
D. Tokenization is used to remove stop words from text.

My answer: B

What is the purpose of the <|unk|> token in a vocabulary used for LLMs?

A. The <|unk|> token is used to represent punctuation marks.
B. The <|unk|> token is used to mark the beginning of a sentence.
C. The <|unk|> token represents unknown words that were not present in the training data.
D. The <|unk|> token is used to mark the end of a sentence.

My answer: C

What is the primary task of an LLM during training?

A. LLMs are trained to translate text from one language to another.
B. LLMs are trained to summarize text.
C. LLMs are trained to answer questions based on a given text.
D. LLMs are trained to predict the next word in a sequence, given a preceding context.

My answer: D (all are LLM tasks, but D is the primary task)

What is the difference between absolute positional embeddings and relative positional embeddings?

A. Absolute positional embeddings encode the exact position of a token in a sequence, while relative positional embeddings encode the relative distance between tokens.
B. Absolute positional embeddings are only used for short sequences, while relative positional embeddings are used for longer sequences.
C. Absolute positional embeddings are more efficient than relative positional embeddings.
D. Relative positional embeddings are more accurate than absolute positional embeddings.

My answer is A (or, as it is stated in the book for relative positions, how far away a token is from other tokens).

The purpose of ____ in the context of LLMs is to provide information about the order and location of tokens within a sequence, helping the LLM understand the relationships between words.

A. attention mechanism
B. positional embeddings
C. tokenization

My answer: B

What is the final output of the input processing pipeline for an LLM, before it is fed into the main LLM layers?

A. The final output is a tensor of probabilities for each word in the vocabulary.
B. The final output is a tensor of text tokens.
C. The final output is a tensor of input embeddings, created by combining token embeddings and positional embeddings.
D. The final output is a tensor of token IDs.

My answer: C

HoHoHo! My answers are correct!

Understanding word embeddings

Why are word embeddings necessary for processing text data in deep learning models?

My answer:

Deep learning models are deep neural networks that work with tensors that have float numbers. We use float numbers because we need precision that we can not achive with integers. These data represent probabilites that particular words are related.

I failed on this one to mention that text is categorical and embeding vectors are compatible with mathematical operations used in deep neural networks.

Workbook answer: Deep learning models operate on numerical data, while text is categorical. Word embeddings convert words into continuous-valued vectors, making them compatible with the mathematical operations used in neural networks.

What is the main idea behind the Word2Vec approach to generating word embeddings?

My answer:

In Word2Vec approach, we translate words to embeding vectors that tell us how close of far away are words based on their meaning.

This was also incomplete answer.

Workbook answer:

Word2Vec trains a neural network to predict the context of a word given the target word or vice versa. This approach assumes that words appearing in similar contexts tend to have similar meanings, resulting in clustered representations of related words in the embedding space.

As I do not understand what is context/target relationship, I asked for help ChatGPT 5.3. to explain workbook answer.

It trains a small neural network on a simple language task:

- `target -> context` (predict nearby words from a center word) = `Skip-gram`
- `context -> target` (predict the center word from surrounding words) = `CBOW`

Example:
- In “the cat sat on the mat”, if the target word is `cat`, context words might be `the`, `sat`.

During training, the model adjusts each word’s vector so it gets better at these predictions.

Why this creates meaning:
- Words used in similar situations (like `cat` and `dog`) tend to have similar neighboring words (`pet`, `food`, `runs`, `cute`).
- So the model learns similar vectors for them.

Result:
- Related words become clustered in embedding space.
- The vectors capture semantic relationships (similarity/relatedness) even though the model was only trained on prediction, not dictionary definitions.

Explain the trade-off involved in choosing the dimensionality of word embeddings.

My answer:

With bigger dimensionality we will have better word prediction, but we will need more process power to train such model, and more memory to use that LLM.

Workbook answer:

Higher dimensionality in word embeddings can capture more nuanced relationships between words but comes at the cost of computational efficiency. Lower dimensionality offers faster processing but may sacrifice some semantic detail.

I was close enough with my answer.

How do LLMs typically handle word embeddings compared to using pretrained models like Word2Vec?

My answer:

I do not know how to compare word embeddings and Word2Vec.

Workbook answer:

LLMs often generate their own embeddings as part of the input layer and optimize them during training. This allows for embeddings tailored to the specific task and data, potentially leading to better performance than using pre-trained embeddings.

Aha, so Word2Vec is pretraining phase that is not needed with embedding vectors as those are created during training phase.

What is the primary challenge associated with visualizing high-dimensional word embeddings?

My answer:

Primary challange is human imagination. Humans can visualize up to four dimensions, three space dimensions and time.

Close enough with workbook answer:

Our visual perception and common graphical representations are limited to three dimensions or fewer. Visualizing high-dimensional embeddings requires specialized techniques or dimensionality reduction methods.

Tokenizing text

What is the purpose of tokenizing text in the context of building a large language model?

My answer: As deep neural networks does not understand human text, we must represent it with tensors with float numbers. To do that, we first need to break text into words (or subwords) and special characters. Then we create vocabulary where each token is represented with ID (integer). And in the end we map those IDs to embeding vectors that are ready for mathematical operations in deep neural networks.

workbook answer: Tokenization is a crucial preprocessing step for creating embeddings for an LLM. It involves splitting input text into individual tokens, which are either words or special characters, to prepare the text for further processing and embedding creation.

Partialy correct because I put under tokenization embeding step.

Describe the process of tokenizing text using Elixir regular expression module Regex

Regex.split(~r/([,.:;?_!"()\']|--|\s)/, text, include_captures: true, trim: true)

My answer:

Ok, when I see Regex, I always get chills at my neck. Here we go (with LLM help):

~r/regex_expression_goes_here/
(capture_goes_here) we tell Regex to remember what was captured in regex expression
| this separates there groups of regex expressions
[list_of_equal_characters] so dear Regex, first try to split on characters between [ and ], and those are special characters from The Verdict novel.
-- yes, we have in Verdict two hyphens in a row
\s this means any space, as we know that space could be encoded in various ways.
include_captures keep all those capture strings, which means everything that goes between ( and )
trim do not return empty string, aka ""

Allegedly, somebody once solved one advent of code only using regex.

Workbook answer: The ‘re.split’ function can be used to split text based on specific patterns. By defining a regular expression that matches whitespace characters, punctuation marks, and other special characters, we can separate the text into individual tokens. The resulting list can then be further processed to remove redundant whitespace characters.

I was correct, but I cheated, I asked llm about options :include_captures and trim

Why is it important to consider capitalization when tokenizing text for LLM training?

Didn’t know what to answer.

Workbook answer:
Capitalization helps LLMs distinguish between proper nouns and common nouns, understand sentence structure, and learn to generate text with proper capitalization. Therefore, preserving capitalization during tokenization is beneficial for training effective language models.

Yes, now I remember, to distinguish noun types.

Explain the trade-off between removing whitespaces during tokenization and keeping them.

Didn’t know what to answer.

Workbook answer:
Removing whitespaces reduces memory and computing requirements. However, keeping whitespaces can be useful for training models that are sensitive to the exact structure of the text, such as Python code, which relies on indentation and spacing.

Ah, famous Python idents! We are spoiled with Elixir syntax and mix format.

Match the term on the left to its description on the right:

Term		Definition
Word Embeddings		The process of converting various data types, such as text, audio, or video, into a dense vector representation that deep learning models can understand.
Embedding		The dimensionality of a word embedding, which determines the number of dimensions used to represent each word, influencing the complexity and computational efficiency of the model.
Word2Vec		An algorithm that generates word embeddings by predicting the context of a word given the target word or vice versa, based on the idea that words appearing in similar contexts tend to have similar meanings.
Embedding Size		A method of representing words as continuous-valued vectors, allowing deep learning models to process text data.

My answer:

Word embeddings => row 4.
Embedding => row 1.
Word2Vec => row 3.
Embedding size => row 2.

Correct!

Converting tokens into token IDs

What is the purpose of converting tokens into token IDs?

My answer:

The purpose of converting tokens to IDs (integers) is to prepare them in a format that is suitable for mathematical operations in a deep neural network.

Workbook answer:

Converting tokens into token IDs is an intermediate step before converting them into embedding vectors. This process allows for efficient representation and processing of text data within a language model.

I would say my answer is correct and maybe even more concrete because it explains what this processing is in an LLM.

How is a vocabulary created for tokenization?

My answer:

We sort tokens, then remove duplicates, and then create a dictionary where each token gets a unique ID.

Workbook answer:

A vocabulary is created by tokenizing the entire training dataset, sorting the unique tokens alphabetically, and assigning a unique integer to each token. This mapping allows for efficient conversion between tokens and their corresponding integer representations.

I did not mention that tokens are sorted alphabetically, but the workbook also misses that tokens are unique.

What is the purpose of the encode function in the SimpleTokenizerV1 module?

My answer:

The encode function converts tokens to vocabulary ID values.

Workbook answer:

The encode method takes text as input, splits it into tokens, and uses the vocabulary to convert these tokens into their corresponding integer IDs. This process allows for representing text data as a sequence of integers, which can be processed by the language model.

I missed the tokenization part. Note that the workbook uses method, and I use function. If you use method in a conversation with Sasa Juric, he would definitely correct you!

What is the purpose of the decode function in the SimpleTokenizerV1 module?

My answer:

The decode function converts vocabulary IDs to token values.

Workbook answer:

The decode method takes a sequence of token IDs as input and uses the inverse vocabulary to convert these IDs back into their corresponding text tokens. This process allows for converting the output of the language model, which is a sequence of integers, back into human-readable text.

Workbook answer is more extensive.

What is the limitation of using a vocabulary built from a small training set?

My answer:

The limitation is that we would have a lot of \|unk\| tokens, and our LLM would not be able to predict the next token reliably.

Workbook answer:

Using a vocabulary built from a small training set can lead to issues when encountering new words or phrases not present in the training data. This can result in errors during tokenization and decoding, highlighting the importance of using large and diverse training sets for building robust language models.

I did not mention that we hit \|unk\| because the input word is not part of the training dataset.

Match the term on the left to its description on the right:

Term		Description
Tokenization		individual units of text that result from tokenization, representing words, punctuation, or other special characters.
Tokens		used to define patterns in text, allowing for flexible and precise text manipulation, including tokenization.
Regular Expressions		initial steps taken to prepare text data for further processing, such as tokenization, which makes the text suitable for use in language models.
Preprocessing		splitting text into individual units, called tokens, which can be words, punctuation marks, or other special characters.

My answer:

Left Hand Column	1	2	3	4
Right Hand Column	4	1	2	3

Correct answer.

Adding special context tokens

What are the two special tokens added to the vocabulary and what are their purposes?

My answer:

The purpose of \|unk\| is to mark a token that is not in the vocabulary.
The purpose of \|could_not_remember\| is to mark different input sources.

Workbook answer:

The two special tokens added are <|unk|> and <|endoftext|>. <|unk|> represents unknown words not in the training data, while <|endoftext|> separates unrelated text sources, helping the LLM understand their distinct nature.

My answer is correct, aside from the fact that I could not remember endoftext.

How does the modified SimpleTokenizerV2 handle unknown words?

My answer:

It adds \|unk\| as an optional special_tokens attribute, so this token is used in place of all unknown tokens.

Workbook answer:

When encountering a word not in the vocabulary, SimpleTokenizerV2 replaces it with the <|unk|> token, ensuring that all words are represented in the encoded text.

Correct answer.

Explain the purpose of the <|endoftext|> token when training on multiple independent documents.

My answer:

This special token separates different text inputs so we know when the input context changes.

Workbook answer:

The <|endoftext|> token acts as a marker between unrelated text sources, signaling the start or end of a particular segment. This helps the LLM understand that these texts, though concatenated for training, are distinct entities.

Different segments is better phrasing than different text inputs.

A piece of the code has been removed from this listing. Which of these terms has been removed from position_1?

A unk
B \n
C <|unk|>
D |unk|

 def encode(text, vocab) do
    tokenize(text)
    |> Enum.map(fn token ->
      Enum.find(vocab, fn {vocab_token, _} -> token == vocab_token end)
      |> case do
        {_, id} -> id
        nil -> find_unknown_token(vocab, token)
      end
    end)
  end

  defp find_unknown_token(vocab, token) do
    Enum.find(vocab, fn {vocab_token, _} -> vocab_token == " position_1 " end)
    |> case do
      {_, id} -> id
      nil -> raise "Token not found in vocab: #{token}"
    end
  end

My answer: C

Note: I used Elixir instead of the original Python.

What are the additional special tokens commonly used in LLMs, and what are their functions?

My answer:

We could use a token that aligns all input texts, separated with \|endoftext\|, to the same length.

Other common special tokens include [BOS] (beginning of sequence), [EOS] (end of sequence), and [PAD] (padding). [BOS] marks the start of a text, [EOS] indicates the end, and [PAD] is used to extend shorter texts to match the length of the longest text in a batch for training.

Padding is what I could not remember.

Match the term on the left to its description on the right:

Term		Description
Vocabulary		integer representations of tokens, used as an intermediate step before converting tokens into embedding vectors.
Token IDs		the dataset used to build the vocabulary and train the language model.
Tokenizer		a mapping from unique tokens to unique integer values, created by tokenizing the entire training dataset and sorting the tokens alphabetically.
Training Set		a class that implements methods for encoding text into token IDs and decoding token IDs back into text.

My answer:

Left Hand Column	1	2	3	4
Right Hand Column	3	1	4	2

Correct.

Byte pair encoding

What are the two stages in this diagram?

My answer:

Stage 1 is where the byte pair encoder splits input text into tokens.
Stage 2 is mapping tokens to their vocabulary ID values.

Correct answer.

What is the primary advantage of using Byte Pair Encoding (BPE) for tokenization, especially when dealing with unknown words?

My answer:

The byte pair encoder splits tokens into subwords, so it can map any word into tokens.

Workbook answer:

BPE tokenizers break down unknown words into smaller subword units or even individual characters. This allows them to handle any word without needing a special <|unk|> token, ensuring that the tokenizer and the LLM can process any text, even if it contains words not present in the training data.

I did not mention that a token could be a single character.

What is the total vocabulary size of the BPE tokenizer used in models like GPT-2, GPT-3, and the original ChatGPT?

My answer:

52345

Workbook answer:

The BPE tokenizer used in these models has a vocabulary size of 50,257, with the <|endoftext|> token assigned the largest token ID.

Not correct.

How does the BPE tokenizer handle unknown words, such as someunknownPlace, without using <|unk|> tokens?

My answer:

Since the byte pair encoder uses subwords as tokens, such as so and me, it can split any word into byte tokens.

Workbook answer:

The BPE tokenizer breaks down unknown words into smaller subword units or individual characters. This allows it to represent any word as a sequence of known subword tokens or characters, enabling it to process any text without needing a special token for unknown words.

I forgot to mention single characters.

What Python library is used to implement the BPE tokenizer in the provided code example?

My answer: tiktoken

Correct.

Data sampling with a sliding window

Explain the purpose of creating input-target pairs in the context of training a large language model (LLM).

My answer:

With input-target pairs, we encode in the training dataset what the next token is for a particular input.

Workbook answer:

the model with examples of text sequences and their corresponding next words. This allows the LLM to learn the relationships between words and predict the most likely next word in a given context.

Now I know the purpose of the sliding window.

Describe the sliding window approach used for generating input-target pairs and how it works.

My answer:

In the sliding window approach, we generate input-target pairs with a stride parameter that defines how many tokens we skip in the input to define the next input-target pair.
For example, with My name is Karlo Smid, stride 2, and input size 3, we have:
My name is => Karlo
is Karlo Smid => “”

Workbook answer:

The sliding window approach involves iterating through a text sequence and extracting overlapping chunks of text as inputs. Each input chunk is paired with the corresponding next word as the target. The window slides across the text, creating multiple input-target pairs for training.

I correctly described the sliding window process.

Pieces of code have been removed from three places in this listing. Which of these terms have been removed, and where should they go?

A Nx.vector
B Tiktoken
C Tokenizer
D Nx.tensor

def chunk_dataset(txt, model, max_length, stride) do
    {:ok, token_ids} = position_1.encode(model, txt, ["<|endoftext|>"])

    if length(token_ids) < max_length + 1,
      do: raise("Number of tokenized inputs must be at least max_length + 1")

    num_chunks = length(token_ids) - max_length

    chunks =
      0..num_chunks//stride
      |> Enum.reduce([input_chunks: [], target_chunks: []], fn i, acc ->
        input_chunk = Enum.slice(token_ids, i..(i + max_length - 1))
        target_chunk = Enum.slice(token_ids, (i + 1)..(i + max_length))

        [
          input_chunks: [position_2(input_chunk) | acc[:input_chunks]],
          target_chunks: [position_2(target_chunk) | acc[:target_chunks]]
        ]
      end)

    [
      input_chunks: Enum.reverse(chunks[:input_chunks]),
      target_chunks: Enum.reverse(chunks[:target_chunks])
    ]
  end

My answer:

position 1 => B
position 2 => D
position 3 => D

Correct as I created Elixir example.

What is the role of the stride parameter in the GPTDatasetV1 class, and how does it affect the generation of input-target pairs?

My answer:

The stride parameter defines the size of the sliding window. This is the number of input tokens we skip when defining the next input-target pair.

Workbook answer:

The stride parameter determines the step size of the sliding window. A smaller stride results in more overlapping input chunks, while a larger stride creates less overlap. The choice of stride influences the amount of data generated and the potential for capturing long-range dependencies in the text.

I did not describe how it affects input-target pairs.

Explain the purpose of the max_length parameter in the GPTDatasetV1 class and its impact on the input-target pairs.

My answer:

The max_length parameter defines the number of tokens per input-target pair. Its purpose is to define how many chunks we want to split the list of input tokens into.

Workbook answer:

The max_length parameter defines the size of the input chunks extracted from the text. It determines the number of tokens included in each input sequence. A larger max_length allows the LLM to process longer contexts, but it also increases the computational cost of training.

Again, I did not explain the effect. The effect is the context size, a famous LLM parameter!

Pieces of code have been removed from two places in this listing. Which of these terms have been removed, and where should they go?

A input_chunks
B Tiktoken
C dataset
D target_chunks

def create_dataloader_v1(opts) do
    raw_text = Keyword.fetch!(opts, :raw_text)
    batch_size = Keyword.get(opts, :batch_size, 4)
    max_length = Keyword.get(opts, :max_length, 256)
    stride = Keyword.get(opts, :stride, 128)
    shuffle = Keyword.get(opts, :shuffle, true)
    drop_last = Keyword.get(opts, :drop_last, true)
    num_workers = Keyword.get(opts, :num_workers, 0)

    # Create dataset
    [input_chunks: input_chunks, target_chunks: target_chunks] =
      LlmScratch.GptDatasetV1.chunk_dataset(raw_text, "code-davinci-002", max_length, stride)

    # Zip input and target chunks together so each item is {input, target}
    dataset = Enum.zip(input_chunks, position_1)

    LlmScratch.DataLoader.new(position_2,
      batch_size: batch_size,
      shuffle: shuffle,
      drop_last: drop_last,
      num_workers: num_workers
    )
  end

My answer:

position_1 => target_chunks
position_2 => dataset

Correct because I created Elixir example.

What is the significance of using PyTorch’s Dataset and DataLoader classes for creating a data loader for LLM training?

My answer:

Dataset and DataLoader help us organize input training data into sliding windows and prepare it for the embedding stage.

Workbook answer:

PyTorch’s Dataset and DataLoader classes provide a convenient and efficient way to manage and iterate over large datasets. They allow for batching, shuffling, and parallel data loading, which are crucial for optimizing the training process of LLMs.

Did not mention shuffle, concurrency and batching feature.

Match the term on the left to its description on the right:

Term		Description
Byte Pair Encoding (BPE)		words that are not present in the tokenizer’s predefined vocabulary.
Subword Units		total number of unique tokens that a tokenizer can recognize and represent.
Out-of-Vocabulary (OOV) Words		smaller units of text that a BPE tokenizer breaks down words into, which can be individual characters or combinations of characters.
Vocabulary Size		a tokenization scheme that breaks down words into smaller subword units or individual characters, allowing it to handle unknown words by representing them as sequences of subword tokens or characters.

My answer:

Left Hand Column	1	2	3	4
Right Hand Column	4	3	1	2

Correct.

Creating token embeddings

Why are embedding vectors necessary for training GPT-like LLMs?

My answer:

They are necessary because, to encode how tokens relate to each other in an LLM, we need floating-point numbers instead of integer token IDs. With integers, we cannot encode these relationships because integers are discrete values.

Workbook answer:

Embedding vectors are essential for training GPT-like LLMs because these models are deep neural networks that rely on the backpropagation algorithm for learning. Backpropagation requires continuous vector representations, which embedding vectors provide.

I missed mentioning the backpropagation algorithm.

How are embedding weights initialized in the beginning of LLM training?

My answer:

Embedding weights are initialized with random floating-point values at the beginning.

Workbook answer:

Embedding weights are initially assigned random values. These random values serve as the starting point for the LLM’s learning process. During training, the embedding weights are optimized through backpropagation to improve the model’s performance.

This is correct, but the workbook also explains what happens to the initial weights during the training phase.

What is the missing stage from this diagram?

My answer:

Token IDs

Correct.

Describe the process of converting a token ID into an embedding vector using an embedding layer.

My answer:

We first create embedding weights with random values, for tensor with shape {vocabular_size, embedding_dimensions}. So each row in this tensor is for token ID, using zero based indexing. First row is for ID 0, second for ID 1, and so on. When we forward embeding weights on input IDs, we get new tensor of shape `{number_of_input_ids, embedding_dimensions}, and this is our embedding vector that we can use to train LLM.

Workbook answer:

The embedding layer acts as a lookup table. When given a token ID, it retrieves the corresponding embedding vector from its weight matrix. This embedding vector is a continuous representation of the token, allowing the LLM to process it effectively.

Correct. But better explanation that this is lookup table for input IDs.

How does the embedding layer’s weight matrix relate to the vocabulary size and embedding dimension?

My answer:

Shape of embedding layer’s weight matrix is defines as {vocabulary_size, embedding_dimension}

Workbook answer:

The embedding layer’s weight matrix has a number of rows equal to the vocabulary size, representing each unique token. The number of columns corresponds to the embedding dimension, which determines the size of the embedding vector for each token.

Correct.

Match the term on the left to its description on the right:

Term		Definition
Context Size		number of positions the input window is shifted when creating the next batch of input–target pairs.
Input–Target Pairs		a technique used to create input–target pairs from a text dataset by moving a window of tokens across the text.
Sliding Window		number of tokens that the LLM uses as input to predict the next word.
Stride		a set of data used to train an LLM, where the input is a sequence of tokens and the target is the next token in the sequence.

My answers:

Context size => row 3
Input-Target Pairs => row 4
Sliding Window => row 2
Stride => row 1

Correct.

Encoding word positions

What is the main shortcoming of LLMs in terms of token order and how is it addressed?

My answer:

The main shortcoming of LLMs in terms of token order is that we always map token IDs to the same embedding vector weights because token IDs are indices of tokens in the vocabulary, and we do not have information about token position in the input text. Token position in the input text is important for next-token prediction. This problem is addressed by adding a vector with token absolute positions to the embedding vector.

Workbook answer:

LLMs’ self-attention mechanism lacks a notion of token order. To address this, positional embeddings are introduced, which provide information about the position of each token within a sequence.

Wrong. I missed that position indices are first forwarded to embedding weights and then added with token embeddings. I also did not mention that we are enhancing the self-attention mechanism.

Explain the difference between absolute and relative positional embeddings.

My answer:

In absolute positioning, we use the token index position in the input text. In relative positional embeddings, we use a value that represents how far tokens are from each other.

Workbook answer:

Absolute positional embeddings assign a unique embedding to each position in a sequence, indicating its exact location. Relative positional embeddings focus on the relative distance between tokens, allowing the model to generalize better to sequences of varying lengths.

Correct. But exact location and index position could use a better explanation, which I provided in a later answer.

How are positional embeddings used in OpenAI’s GPT models?

My answer:

We add positional embeddings to token embeddings.

Workbook answer:

GPT models use absolute positional embeddings that are optimized during training. These embeddings are not fixed or predefined but are learned alongside the model’s other parameters.

Incorrect. I did not mention absolute positions or that those embeddings are not fixed but learned, because we also pass those positions together with token embeddings.

Describe the process of creating input embeddings for an LLM using token embeddings and positional embeddings.

My answer:

In the case of absolute positions, we create a positional embedding vector that has, for each token, its index, which represents distance from the beginning of the input. Then we add this positional embedding vector to the token embedding vector.

Workbook answer:

Token embeddings are generated by mapping token IDs to vectors. Positional embeddings are then added to these token embeddings, resulting in input embeddings that incorporate both token identity and positional information.

Incorrect. I forgot one step where positional IDs are also passed to obtain positional embeddings. Then we add token embeddings and positional embedding vectors.

What is the purpose of the token_embedding_layer and pos_embedding_layer in the code provided?

My answer:

These are token embedding and positional embedding layers.

Workbook answer:

The token_embedding_layer converts token IDs into embedding vectors, while the pos_embedding_layer generates positional embeddings based on the position of each token in the sequence.

Incorrect. I did not mention the conversion process.

Conclusion

That is it for the workbook questions. Open-ended questions are still hard for me, but I am satisfied with how I generally answered them in this chapter.

References

Build a LLM from Scratch

Build LLM from Scratch, Chapter 2 — Working with Text Data

2026-01-27T09:00:00+00:00

TL;DR

raw text
  -> tokens
  -> token IDs
  -> sliding windows (input/target pairs)
  -> token embeddings
  -> positional embeddings
  -> model-ready tensors

Now that we have basic knowledge about LLMs, it’s time to get our hands dirty. Sebastian’s book uses Python, but I’m taking a different route: Elixir. The reason is simple—Elixir has solid machine-learning support, and I want to dig into those libraries. All elixir code is in my git repo

Embeddings

human-readable text
  -> tokens
  -> numbers (IDs / vectors)
  -> tensors
  -> neural network computations

We use mathematical operations to train and implement neural networks. Those operations work on numbers, so human-readable text is not compatible input. We split input text (text that humans can read and understand) into tokens (for example, words and punctuation), and then represent tokens with continuous-valued vectors. Sebastian directs us to Appendix A for a brief introduction to vectors and tensors. We need to understand the math if we want to understand LLMs under the hood. If you hate math, don’t despair: the application of this math is a chatbot, a very exciting technology.

Vectors and Tensors in Elixir

rank 0 -> scalar
rank 1 -> vector
rank 2 -> matrix
rank 3+ -> tensor

Nx.rank / Nx.shape -> inspect dimensions

A tensor is a mathematical object that groups numbers. It has a rank that represents the number of dimensions. In Elixir we have the Nx library, which supports tensors. Here is a tensor of rank 0:

iex(2)> Nx.tensor(1)
#Nx.Tensor<
  s32
  1
>
Nx.tensor(1) |> Nx.rank
0

Yes, we all know that tensor: it is a regular number. Nx handles number types. s32 stands for a signed 32-bit integer (int32), which means it occupies 32 bits in memory. Remember that next time you want to run an LLM on your machine. Every model has its size in its name. For example, 7B stands for 7 billion numbers. If each number occupies 32 bits, we need around 28 GB of memory (about 26.7 GiB).

Moving forward, a one-dimensional tensor is a vector (list), and a two-dimensional tensor is a matrix (list of lists). A tensor of three dimensions is, well, just a tensor of three dimensions — or nested lists.

iex(12)> Nx.tensor([1,2,3])
#Nx.Tensor<
  s32[3]
  [1, 2, 3]
>

iex(13)> Nx.tensor([[1,2], [3,4], [5,6]])
#Nx.Tensor<
  s32[3][2]
  [
    [1, 2],
    [3, 4],
    [5, 6]
  ]
>

#Nx.Tensor<
  s32[3][1][2]
  [
    [
      [1, 2]
    ],
    [
      [3, 4]
    ],
    [
      [5, 6]
    ]
  ]
>

Tensors have some typical operations. We have already seen Nx.rank. We also have Nx.shape:

Nx.tensor([[[1,2]], [[3,4]], [[5,6]]]) |> Nx.shape
{3, 1, 2}

The first list has 3 elements, the next nested list has one element, and the last nested list has two elements.

Nx.reshape is another useful function that changes the shape of a tensor. This will be very useful later in this book.

Nx.tensor([[1,2], [3,4], [5,6]]) |> Nx.reshape({2,3})
#Nx.Tensor<
  s32[2][3]
  [
    [1, 2, 3],
    [4, 5, 6]
  ]
>

PyTorch also has the view function, which is the same as reshape. Nx does not have a view function. The next useful function is transpose, which flips tensor values across the diagonal:

Nx.tensor([[1,2, 3], [4, 5,6], [7,8,9]])
#Nx.Tensor<
  s32[3][3]
  [
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
  ]
>
iex(23)> Nx.tensor([[1,2, 3], [4, 5,6], [7,8,9]]) |> Nx.transpose
#Nx.Tensor<
  s32[3][3]
  [
    [1, 4, 7],
    [2, 5, 8],
    [3, 6, 9]
  ]
>

And we want to multiply tensors:

# rank 0 
iex(33)> Nx.multiply(1,2)
#Nx.Tensor<
  s32
  2
>
# rank 1
iex(34)> Nx.multiply(Nx.tensor([1,2,3]),2)
#Nx.Tensor<
  s32[3]
  [2, 4, 6]
>
# rank 2
iex(35)> Nx.multiply(Nx.tensor([1,2,3]),Nx.tensor([1,2,3]))
#Nx.Tensor<
  s32[3]
  [1, 4, 9]
>

This detour about tensors ends here, and we survived. Elixir has a tensor implementation in the Nx library!

Embedding is the concept where we transform input data (human-readable text) into vector form. We can also embed video, audio, and images, so embedding is not only for text. Each data type has its own embedding model, so a text embedding model cannot be used for videos. A unit in text processing could be a word, sentence, or paragraph. Now I know what RAG is: retrieval-augmented generation. The model combines text generation with retrieval of external knowledge sources (the internet or databases).

An LLM is not the only algorithm for creating word embeddings. One example is the Word2Vec algorithm. That algorithm predicts the context of a word based on a target word. The idea is that words with similar contexts are close to each other. What does “close” mean? For example, if a vector has 2 dimensions, the words are represented with points in an x-y coordinate system.

Below is a simple ASCII diagram of the Word2Vec workflow (CBOW and Skip-gram) and a tiny 2D coordinate sketch to show “closeness” in the embedding space:

EMBEDDINGS: 2D COORDINATE SKETCH

y
^
|             (king)
|               *
|      (queen) *      *
|                    (prince)
|
|   (cat)  *           (dog) *
|
+----------------------------------> x

Word2Vec has more than 2 dimensions, up to 1000. LLMs have more dimensions; for example, GPT-3 has 12,288 dimensions. The number of dimensions is a trade-off between performance and efficiency.

Give me some tokens!

The Verdict (raw text)
  -> split text
  -> keep useful punctuation
  -> token list

We will tokenize the short story The Verdict by Edith Wharton.

The narrator visits the Riviera home of Jack Gisburn, a once‑celebrated painter who abruptly stopped working after marrying a wealthy widow. Looking at Gisburn’s last portrait, the narrator recognizes it as empty showmanship and realizes the artist’s reputation rested more on social fashion than genuine talent. The story ends with the narrator quietly keeping this “verdict” to himself, leaving Gisburn’s public acclaim undisturbed.

Here is how you can manipulate files and URLs in Elixir. We use the proven Req library for URL processing, created by Wojtek Mach. The File module is included in Elixir.

test "the-verdict.txt character count and exact line 99 content" do
    url =
      "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"

    filename = "the-verdict.txt"
    %Req.Response{status: 200, body: body} = Req.get!(url)
    File.write!(filename, body)
    # Read the file content as a single string
    
    {:ok, file_content} = File.read(filename)

    # Assert on the number of characters in the file
    char_count = String.length(file_content)
    assert char_count == 20_479
    first_99_chars = String.slice(file_content, 0, 99)

    # Assert on content of line 99
    assert first_99_chars ==
             "I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no "
  end

Regular expressions

Regex.split(...)
  -> split on whitespace / punctuation
  -> optionally keep separators
  -> trim empties
  -> tokenizer output pieces

For educational purposes, we will implement the first version of the tokenizer using regular expressions:

  test "split text on whitespace, keep the whitespace" do
    text = "Hello, world. This, is a test."
    result = Regex.split(~r{\s}, text, include_captures: true, trim: false)
    assert result == ["Hello,", " ", "world.", " ", "This,", " ", "is", " ", "a", " ", "test."]
  end

The Regex module is included in Elixir. \s means we split the input on whitespace; include_captures determines whether we keep what we split on (in this example, whitespace); and trim: false means we do not trim whitespace before or after the string. The result is what we got.

  test "split text on whitespace, commas, and periods, keep them except whitespace" do
    text = "Hello, world. This, is a test."
    result = Regex.split(~r{[,.]|\s}, text, include_captures: true, trim: true)

    assert result == [
             "Hello",
             ",",
             " ",
             "world",
             ".",
             " ",
             "This",
             ",",
             " ",
             "is",
             " ",
             "a",
             " ",
             "test",
             "."
           ]

    # Remove whitespaces from the result
    result_no_whitespace = Enum.reject(result, fn s -> s == " " end)

    assert result_no_whitespace == [
             "Hello",
             ",",
             "world",
             ".",
             "This",
             ",",
             "is",
             "a",
             "test",
             "."
           ]
  end

Here we add a dot and comma to the split characters, keep them, and trim the split items.

  test "split text on punctuation, keep them except whitespace" do
    text = "Hello, world. Is this-- a test?"
    result = Regex.split(~r{[,.:;?_!"()\']|--|\s}, text, include_captures: true, trim: true)
    # Remove whitespaces from the result
    result_no_whitespace = Enum.reject(result, fn s -> s == " " end)

    assert result_no_whitespace == [
             "Hello",
             ",",
             "world",
             ".",
             "Is",
             "this",
             "--",
             "a",
             "test",
             "?"
           ]
  end

We extend the list of split characters to match what is used in The Verdict, and we remove whitespace from the result.

  test "predprocess the verdict.txt file" do
    filename = "the-verdict.txt"
    {:ok, file_content} = File.read(filename)

    result =
      Regex.split(~r/([,.:;?_!"()\']|--|\s)/, file_content, include_captures: true, trim: true)

    # Remove whitespaces from the result - match Python's behavior: strip and filter empty
    predprocessed_text =
      result
      |> Enum.map(&String.trim/1)
      |> Enum.reject(&(&1 == ""))

    assert Enum.slice(predprocessed_text, 0..29) == [
             "I",
             "HAD",
             "always",
             "thought",
             "Jack",
             "Gisburn",
             "rather",
             "a",
             "cheap",
             "genius",
             "--",
             "though",
             "a",
             "good",
             "fellow",
             "enough",
             "--",
             "so",
             "it",
             "was",
             "no",
             "great",
             "surprise",
             "to",
             "me",
             "to",
             "hear",
             "that",
             ",",
             "in"
           ]

    assert length(predprocessed_text) == 4690

    # Use Pythonx to read the file
    {result_obj, _globals} =
      Pythonx.eval(
        """
        import re
        with open("the-verdict.txt", "r", encoding="utf-8") as f:
            raw_text = f.read()
        pattern = r'''([,.:;?_!"()']|--|\s)'''
        preprocessed = re.split(pattern, raw_text)
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        preprocessed
        """,
        %{}
      )

    preprocessed_text_python = Pythonx.decode(result_obj)
    assert preprocessed_text_python == predprocessed_text
 
    # Verify the file content starts correctly
    assert String.starts_with?(
             file_content,
             "I HAD always thought Jack Gisburn rather a cheap genius"
           )
  end

We processed The Verdict file. We trim after the split and remove empty strings. To compare with Sebastian’s Python implementation, we use the Pythonx Elixir library.

Give me some IDs

tokens
  -> sort + deduplicate
  -> vocabulary
  -> token <-> integer ID mapping
  -> encode / decode

Idea is very simple. From tokens, we need to create vocabulary. We sort tokens that we have, then we remove duplicates, this is vocabulary. Then each token get’s integer number.

test "token IDs" do
    filename = "the-verdict.txt"
    {:ok, file_content} = File.read(filename)

    result =
      Regex.split(~r/([,.:;?_!"()\']|--|\s)/, file_content, include_captures: true, trim: true)

    # Remove whitespaces from the result - match Python's behavior: strip and filter empty
    predprocessed_text =
      result
      |> Enum.map(&String.trim/1)
      |> Enum.reject(&(&1 == ""))

    all_words = Enum.sort(MapSet.new(predprocessed_text))
    vocab_size = length(all_words)
    assert vocab_size == 1130

    vocab =
      Enum.with_index(all_words)

    assert Enum.slice(vocab, 0..50) == [
             {"!", 0},
             {"\"", 1},
             {"'", 2},
             {"(", 3},
             {")", 4},
             {",", 5},
             {"--", 6},
             {".", 7},
             {":", 8},
             {";", 9},
             {"?", 10},
             {"A", 11},
             {"Ah", 12},
             {"Among", 13},
             {"And", 14},
             {"Are", 15},
             {"Arrt", 16},
             {"As", 17},
             {"At", 18},
             {"Be", 19},
             {"Begin", 20},
             {"Burlington", 21},
             {"But", 22},
             {"By", 23},
             {"Carlo", 24},
             {"Chicago", 25},
             {"Claude", 26},
             {"Come", 27},
             {"Croft", 28},
             {"Destroyed", 29},
             {"Devonshire", 30},
             {"Don", 31},
             {"Dubarry", 32},
             {"Emperors", 33},
             {"Florence", 34},
             {"For", 35},
             {"Gallery", 36},
             {"Gideon", 37},
             {"Gisburn", 38},
             {"Gisburns", 39},
             {"Grafton", 40},
             {"Greek", 41},
             {"Grindle", 42},
             {"Grindles", 43},
             {"HAD", 44},
             {"Had", 45},
             {"Hang", 46},
             {"Has", 47},
             {"He", 48},
             {"Her", 49},
             {"Hermia", 50}
           ]
  end

To remove duplicate tokens, we use MapSet datastruct.

  iex(1)> MapSet.new(["Karlo", "karlo", "Karlo", "Smid"])
MapSet.new(["Karlo", "Smid", "karlo"])

In the end we list first 51 vocabulary entries.

Last thing to do is simple tokenizer, Python has class, Elixir module. We have following functions:

Function	Description
vocab_from_file	Creates a sorted token vocabulary from text file (with special tokens)
encode	Converts a text string into a sequence of IDs using vocabulary
decode	Converts a sequence of IDs back into a text string using vocabulary
tokenize	Splits text into tokens (words and punctuation used in The Verdict)

defmodule LlmScratch.SimpleTokenizerV1 do
  @moduledoc """
  A simple tokenizer that tokenizes text into words.
  """
  def vocab_from_file(filename, special_tokens \\ []) do
    {:ok, file_content} = File.read(filename)

    tokens =
      tokenize(file_content)
      |> MapSet.new()
      |> Enum.sort()

    (tokens ++ special_tokens)
    |> Enum.with_index()
  end

  def encode(text, vocab) do
    tokenize(text)
    |> Enum.map(fn token ->
      Enum.find(vocab, fn {vocab_token, _} -> token == vocab_token end)
      |> case do
        {_, id} -> id
        nil -> find_unknown_token(vocab, token)
      end
    end)
  end

  defp find_unknown_token(vocab, token) do
    Enum.find(vocab, fn {vocab_token, _} -> vocab_token == "<|unk|>" end)
    |> case do
      {_, id} -> id
      nil -> raise "Token not found in vocab: #{token}"
    end
  end

  def decode(ids, vocab) do
    text =
      Enum.map(ids, fn id ->
        Enum.find(vocab, fn {_, vocab_id} -> id == vocab_id end)
        |> case do
          {vocab_token, _} -> vocab_token
          nil -> raise "ID not found in vocab: #{id}"
        end
      end)
      |> Enum.join(" ")

    Regex.replace(~r/\s+([,.?!"()\'])/, text, "\\1")
  end

  defp tokenize(text) do
    Regex.split(~r/([,.:;?_!"()\']|--|\s)/, text, include_captures: true, trim: true)
    |> Enum.map(&String.trim/1)
    |> Enum.reject(&(&1 == ""))
  end
end

And now we can use simple tokenizer:

test "encode and decode text with simple tokenizer" do
    filename = "the-verdict.txt"
    vocab = LlmScratch.SimpleTokenizerV1.vocab_from_file(filename)

    text = """
    "It's the last he painted, you know," 
           Mrs. Gisburn said with pardonable pride.
    """

    encoded_text = LlmScratch.SimpleTokenizerV1.encode(text, vocab)

    assert encoded_text == [
             1,
             56,
             2,
             850,
             988,
             602,
             533,
             746,
             5,
             1126,
             596,
             5,
             1,
             67,
             7,
             38,
             851,
             1108,
             754,
             793,
             7
           ]

    decoded_text = LlmScratch.SimpleTokenizerV1.decode(encoded_text, vocab)

    assert decoded_text ==
             "\" It' s the last he painted, you know,\" Mrs. Gisburn said with pardonable pride."
  end

What about unknown tokens?

  test "missing token in vocab raises an error" do
    vocab = LlmScratch.SimpleTokenizerV1.vocab_from_file("the-verdict.txt")
    text = "Hello, do you like tea. Is this-- a test?"

    assert_raise RuntimeError, "Token not found in vocab: Hello", fn ->
      LlmScratch.SimpleTokenizerV1.encode(text, vocab)
    end
  end

I am a special token

In the token world we also have special tokens. These tokens have a special purpose: they set context in the training data. For example, when we encounter an unknown vocabulary token, it is not useful training practice to raise RuntimeError; instead, we mark that we hit an unknown token and move on. For that purpose, we use the token |unk|. We are training our LLM on one short story, The Verdict. But in the real world, we will use many books. To set context for a book, we use the token |endoftext| as the last token of the book; then the training algorithm knows to set the context of the next book. Let’s see that in our SimpleTokenizerV1:

  test "encode and decode text with special token" do
    vocab =
      LlmScratch.SimpleTokenizerV1.vocab_from_file("the-verdict.txt", ["<|endoftext|>", "<|unk|>"])

    assert length(vocab) == 1132
    last_five = Enum.take(vocab, -5)

    assert last_five == [
             {"younger", 1127},
             {"your", 1128},
             {"yourself", 1129},
             {"<|endoftext|>", 1130},
             {"<|unk|>", 1131}
           ]

    text1 = "Hello, do you like tea?"
    text2 = "In the sunlit terraces of the palace."
    text = text1 <> " <|endoftext|> " <> text2

    assert text == "Hello, do you like tea? <|endoftext|> In the sunlit terraces of the palace."

    assert LlmScratch.SimpleTokenizerV1.encode(text, vocab) == [
             1131,
             5,
             355,
             1126,
             628,
             975,
             10,
             1130,
             55,
             988,
             956,
             984,
             722,
             988,
             1131,
             7
           ]

    assert text
           |> LlmScratch.SimpleTokenizerV1.encode(vocab)
           |> LlmScratch.SimpleTokenizerV1.decode(vocab) ==
             "<|unk|>, do you like tea? <|endoftext|> In the sunlit terraces of the <|unk|>."
  end

We add |endoftext| and |unk| at the end of our vocabulary, and we concatenate two text examples with different contexts that contain unknown vocabulary tokens. In the end, you see what we get after the encode() |> decode() sequence. You can use other tokens for the same purpose, like [BOS] and [EOS] for beginning and end of sequence, or [PAD] for padding input texts of different contexts to the same length.

Byte pair encoding

text
  -> bytes / subword pieces
  -> frequent merges
  -> BPE tokens
  -> handles unknown words better

Byte pair encoding (BPE) is the real deal because it was used to tokenize the training data for LLMs like GPT-2, GPT-3, and the original models used in ChatGPT. Sebastian does not go into the details of the Byte pair encoding algorithm because it is a complicated topic. In the byte pair encoding algorithm, text is broken into letters, like a, b, c, … Then, in a second pass, it merges letters that usually go together, like de in dentist or define. That is the basic idea; the rest you can find in the paper. Because of that, byte pair encoding can handle all words, not just words from the vocabulary. tiktoken is a popular library with a byte pair encoding implementation. There is a Rust implementation, and thanks to the rustler library, we have tiktoken for Elixir. Note how SomeunknownPlace is successfully decoded.

test "byte pair encoding using gpt2 tiktoken" do
    model = "code-davinci-002"

    text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces\of someunknownPlace."

    {:ok, encoded_tokens} = Tiktoken.encode(model, text, ["<|endoftext|>"])

    assert encoded_tokens == [
             15496,
             11,
             466,
             345,
             588,
             8887,
             30,
             220,
             50256,
             554,
             262,
             4252,
             18250,
             8812,
             2114,
             1659,
             617,
             34680,
             27271,
             13
           ]

    assert Tiktoken.decode(model, encoded_tokens) == {:ok, text}
  end

Sliding windows

token sequence
  -> window(size=max_length, stride)
  -> input chunk
  -> target chunk (shifted by 1)
  -> many training samples

Ok, this section was hard for me, I did several reading pases. For traning the llm, we do not use just embeded input text. We need to have embeded input/target pairs. Idea is rather simple, we are using sliding window.

Example with sliding window size 5 (text only, no embeddings):

Text: I love learning how to build small language models

Tokens: ["I", "love", "learning", "how", "to", "build", "small", "language", "models"]

Training input/target pairs:

input: ["I", "love", "learning", "how", "to"] -> target: "build"
input: ["love", "learning", "how", "to", "build"] -> target: "small"
input: ["learning", "how", "to", "build", "small"] -> target: "language"
input: ["how", "to", "build", "small", "language"] -> target: "models"

Let’s do that in Elixir using The Verdict story. We are using last 50 tokens from The Verdict because ending is interesting part.

test "data sampling with sliding window" do
    {:ok, file_content} = File.read("the-verdict.txt")
    model = "code-davinci-002"
    {:ok, encoded_tokens} = Tiktoken.encode(model, file_content)

    assert length(encoded_tokens) == 5145
    encoded_last_50 = Enum.drop(encoded_tokens, 50)
    context_size = 4

    context_desired_pairs =
      for i <- 1..context_size do
        context = Enum.slice(encoded_last_50, 0..(i - 1))
        desired = Enum.at(encoded_last_50, i)
        {context, desired}
      end

    assert context_desired_pairs == [
             {[290], 4920},
             {[290, 4920], 2241},
             {[290, 4920, 2241], 287},
             {[290, 4920, 2241, 287], 257}
           ]

    decoded_context_desired_pairs =
      Enum.map(context_desired_pairs, fn {current_context, current_desired} ->
        {:ok, decoded_current_context} = Tiktoken.decode(model, current_context)
        {:ok, decoded_current_desired} = Tiktoken.decode(model, [current_desired])
        {decoded_current_context, decoded_current_desired}
      end)

    assert decoded_context_desired_pairs == [
             {" and", " established"},
             {" and established", " himself"},
             {" and established himself", " in"},
             {" and established himself in", " a"}
           ]
  end

Note: DataSet and DataLoader is the hardest part so far. I will try to do my best, but you will have to start up your imagination for easier understanding of this concepts.

Next what we need are DataLoader and DataSet helpers. In Python world we use PyTorch classes, last time I checked in Elixir, there are no such modules in any library. If you know about them, please let me know (LinkedIn is the best option). DataLoader and DataSet I crafted with llm help:

  defmodule LlmScratch.GptDatasetV1 do
  @doc """
  Tokenizes `txt` and builds next-token prediction windows.

  ## Parameters

  - `txt` - input text to tokenize (`String.t()`)
  - `model` - tokenizer model name passed to `Tiktoken.encode/3` (`String.t()`)
  - `max_length` - number of tokens per input/target chunk (`pos_integer()`)
  - `stride` - number of tokens to move between consecutive windows (`pos_integer()`)

  Returns keyword list with:
  - `:input_chunks` - list of input tensors shaped `{max_length}`
  - `:target_chunks` - list of target tensors shaped `{max_length}`

  Each target chunk is the corresponding input chunk shifted by one token.
  Windows are generated with step `stride`.
  """
  def chunk_dataset(txt, model, max_length, stride) do
    {:ok, token_ids} = Tiktoken.encode(model, txt, ["<|endoftext|>"])

    if length(token_ids) < max_length + 1,
      do: raise("Number of tokenized inputs must be at least max_length + 1")

    num_chunks = length(token_ids) - max_length

    chunks =
      0..num_chunks//stride
      |> Enum.reduce([input_chunks: [], target_chunks: []], fn i, acc ->
        input_chunk = Enum.slice(token_ids, i..(i + max_length - 1))
        target_chunk = Enum.slice(token_ids, (i + 1)..(i + max_length))

        [
          input_chunks: [Nx.tensor(input_chunk) | acc[:input_chunks]],
          target_chunks: [Nx.tensor(target_chunk) | acc[:target_chunks]]
        ]
      end)

    [
      input_chunks: Enum.reverse(chunks[:input_chunks]),
      target_chunks: Enum.reverse(chunks[:target_chunks])
    ]
  end

  @doc """
  Creates a `LlmScratch.DataLoader` from raw text using GPT-style input/target pairs.

  ## Parameters

  - `opts` - keyword options controlling dataset chunking and dataloader behavior

  ## Options (`opts`)

  - `:raw_text` (required) - source text to tokenize
  - `:batch_size` (default: `4`) - number of `{input, target}` pairs per batch
  - `:max_length` (default: `256`) - token length of each sequence chunk
  - `:stride` (default: `128`) - step between consecutive windows
  - `:shuffle` (default: `true`) - whether to shuffle dataset before cycling
  - `:drop_last` (default: `true`) - whether to drop incomplete batches
  - `:num_workers` (default: `0`) - worker count for async iteration
  """
  def create_dataloader_v1(opts) do
    raw_text = Keyword.fetch!(opts, :raw_text)
    batch_size = Keyword.get(opts, :batch_size, 4)
    max_length = Keyword.get(opts, :max_length, 256)
    stride = Keyword.get(opts, :stride, 128)
    shuffle = Keyword.get(opts, :shuffle, true)
    drop_last = Keyword.get(opts, :drop_last, true)
    num_workers = Keyword.get(opts, :num_workers, 0)

    # Create dataset
    [input_chunks: input_chunks, target_chunks: target_chunks] =
      LlmScratch.GptDatasetV1.chunk_dataset(raw_text, "code-davinci-002", max_length, stride)

    # Zip input and target chunks together so each item is {input, target}
    dataset = Enum.zip(input_chunks, target_chunks)

    LlmScratch.DataLoader.new(dataset,
      batch_size: batch_size,
      shuffle: shuffle,
      drop_last: drop_last,
      num_workers: num_workers
    )
  end
end

Let’s start with chunk_dataset function and a test that is using this function:

test "chunk dataset" do
    txt = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
    model = "code-davinci-002"
    max_length = 10 - 1
    stride = 5
    chunks = LlmScratch.GptDatasetV1.chunk_dataset(txt, model, max_length, stride)
    assert length(chunks) == 2

    [input_chunks: input_chunks, target_chunks: target_chunks] = chunks

    assert input_chunks == [
             Nx.tensor([15496, 11, 466, 345, 588, 8887, 30, 220, 50256]),
             Nx.tensor([8887, 30, 220, 50256, 554, 262, 4252, 18250, 8812]),
             Nx.tensor([262, 4252, 18250, 8812, 2114, 286, 617, 34680, 27271])
           ]

    assert target_chunks == [
             Nx.tensor([11, 466, 345, 588, 8887, 30, 220, 50256, 554]),
             Nx.tensor([30, 220, 50256, 554, 262, 4252, 18250, 8812, 2114]),
             Nx.tensor([4252, 18250, 8812, 2114, 286, 617, 34680, 27271, 13])
           ]
  end

chunk_dataset has four attributes, please check the function documentation. txt is input text that we want to tokenize. model is a code for tokenizer model. We aligned with model used in book, “code-davinci-002” (but it took me some trial and errors to figure out which code model to use, but little tester in me enjoyed that part!).
max_length defines number of Tensor elements, or using Tensor language, Nx.shape is {9} as Nx.rank is 1. So we have a list of 9 elements. You can see that shape in test assert. And that is actually window size.
stride is how many tokens we skip when we start new window. This is why in input_chunks, second tensor starts with token 8887, because this is first tensor element with index 5 (where we start counting from 0).
So we are cunking encoded input text in the shape that will be used to train our model.
chunk_dataset function is used in create_dataloader_v1, so let’s learn about that function also with a test example.

  test "gpt dataset v1" do
    {:ok, file_content} = File.read("the-verdict.txt")
    model = "code-davinci-002"

    dataloader =
      LlmScratch.GptDatasetV1.create_dataloader_v1(
        raw_text: file_content,
        batch_size: 1,
        max_length: 4,
        stride: 1,
        shuffle: false,
        drop_last: false,
        num_workers: 0
      )

    batch_0 = dataloader.stream |> Enum.at(0)
    [{input_chunk, target_chunk}] = batch_0
    input_decoded_chunk = decode_token_pieces(model, Nx.to_flat_list(input_chunk)) 
    target_decoded_chunk = decode_token_pieces(model, Nx.to_flat_list(target_chunk))  
    assert batch_0 == [{Nx.tensor([40, 367, 2885, 1464]), Nx.tensor([367, 2885, 1464, 1807])}]
    assert [{input_decoded_chunk, target_decoded_chunk}] == [
             {["I", " H", "AD", " always"], [" H", "AD", " always", " thought"]}
           ]
  end

Input parameters are well documented using @doc comment, please check that documentation first. raw_text is human readable text, in our case The Verdict story. batch_size determins number of pairs {input_tensor, output_tensor} in one batch. max_length and stride we already know from chunk_data function. Last three parameters shuffle, drop_last and num_workers we just pass to DataLoader and will be explained next as part of DataLoader explanation.
To see how data grows with batch_size 8, we have a test for that. In practice, batch_size is at least 256.

  test "gpt dataset v1, batch_size is 8" do
    {:ok, file_content} = File.read("the-verdict.txt")
    model = "code-davinci-002"

    dataloader =
      LlmScratch.GptDatasetV1.create_dataloader_v1(
        raw_text: file_content,
        batch_size: 8,
        max_length: 4,
        stride: 4,
        shuffle: false,
        drop_last: false,
        num_workers: 0
      )

    batch_0 = dataloader.stream |> Enum.at(0)

    decoded_batch_0 =
      Enum.map(batch_0, fn {input_chunk, target_chunk} ->
        {
          decode_token_pieces(model, Nx.to_flat_list(input_chunk)),
          decode_token_pieces(model, Nx.to_flat_list(target_chunk))
        }
      end)

    assert batch_0 == [
             {Nx.tensor([40, 367, 2885, 1464]), Nx.tensor([367, 2885, 1464, 1807])},
             {Nx.tensor([1807, 3619, 402, 271]), Nx.tensor([3619, 402, 271, 10899])},
             {Nx.tensor([10899, 2138, 257, 7026]), Nx.tensor([2138, 257, 7026, 15632])},
             {Nx.tensor([15632, 438, 2016, 257]), Nx.tensor([438, 2016, 257, 922])},
             {Nx.tensor([922, 5891, 1576, 438]), Nx.tensor([5891, 1576, 438, 568])},
             {Nx.tensor([568, 340, 373, 645]), Nx.tensor([340, 373, 645, 1049])},
             {Nx.tensor([1049, 5975, 284, 502]), Nx.tensor([5975, 284, 502, 284])},
             {Nx.tensor([284, 3285, 326, 11]), Nx.tensor([3285, 326, 11, 287])}
           ]

    assert decoded_batch_0 == [
             {["I", " H", "AD", " always"], [" H", "AD", " always", " thought"]},
             {[" thought", " Jack", " G", "is"], [" Jack", " G", "is", "burn"]},
             {["burn", " rather", " a", " cheap"], [" rather", " a", " cheap", " genius"]},
             {[" genius", "--", "though", " a"], ["--", "though", " a", " good"]},
             {[" good", " fellow", " enough", "--"], [" fellow", " enough", "--", "so"]},
             {["so", " it", " was", " no"], [" it", " was", " no", " great"]},
             {[" great", " surprise", " to", " me"], [" surprise", " to", " me", " to"]},
             {[" to", " hear", " that", ","], [" hear", " that", ",", " in"]}
           ]
  end

And now is time to tackle the DataLoader beast:

  defmodule LlmScratch.DataLoader do
  @doc """
  Builds a data loader map from an in-memory dataset.

  Returns a map with:
  - `:stream` - an infinite stream of batches
  - `:batch_size` - configured batch size (default `32`)
  - `:drop_last` - whether incomplete batches are dropped
  - `:num_workers` - concurrency used by `iterate/2`

  ## Options
  - `:batch_size` - number of samples per batch (default: `32`)
  - `:shuffle` - shuffles dataset once before cycling (default: `true`)
  - `:drop_last` - drops batches smaller than `:batch_size` (default: `false`)
  - `:num_workers` - parallel workers for iteration (default: `0`)
  """
  def new(dataset, opts \\ []) when is_list(dataset) do
    batch_size = Keyword.get(opts, :batch_size, 32)
    shuffle = Keyword.get(opts, :shuffle, true)
    drop_last = Keyword.get(opts, :drop_last, false)
    num_workers = Keyword.get(opts, :num_workers, 0)

    stream =
      dataset
      |> prepare_dataset(shuffle)
      |> Stream.chunk_every(batch_size)
      |> filter_incomplete_batches(drop_last, batch_size)

    %{
      stream: stream,
      batch_size: batch_size,
      drop_last: drop_last,
      num_workers: num_workers
    }
  end
  
  @doc """
  Iterates over loader batches and applies `fun` to each batch.

  When `num_workers` is `0`, batches are processed sequentially.
  When `num_workers` is greater than `0`, batches are processed concurrently
  using `Task.async_stream/3`.
  """
  def iterate(%{stream: stream, num_workers: 0}, fun) when is_function(fun, 1) do
    Enum.each(stream, fun)
  end

  def iterate(%{stream: stream, num_workers: num_workers}, fun)
      when is_function(fun, 1) and num_workers > 0 do
    stream
    |> Task.async_stream(fun, max_concurrency: num_workers, ordered: false)
    |> Stream.run()
  end

  defp prepare_dataset(dataset, true) do
    dataset
    |> Enum.shuffle()
    |> Stream.cycle()
  end

  defp prepare_dataset(dataset, false) do
    Stream.cycle(dataset)
  end

  defp filter_incomplete_batches(stream, true, batch_size) do
    Stream.filter(stream, &(length(&1) == batch_size))
  end

  defp filter_incomplete_batches(stream, false, _batch_size) do
    stream
  end
end

Documentation starts with function output struct. First key in that struct is :datastream, that is an infinitive of batches. So what does infinitive stream really means? If you check private function prepare_dataset, you will see that returns call to Stream.cycle(dataset). We know that dataset are our chunks of input data, but our input data is not infinitive, it is The Verdict story. Here is example what Stream.cycle do:

[1,2,3] |> Stream.cycle() |> Enum.take(8)
# => [1,2,3,1,2,3,1,2]

As the name states, it just cycles through input list, repeating it indefinetly. Stream.cycle does not hold this data in memory, it just know how to do the cycle via function calls. Stream.cycle always goes hand with hand with Enum function calls. `Enum.take(8) takes first stream elements.
But this will run indefinetly:

[1,2,3] |> Stream.cycle() |> Enum.each(fn x -> x end)

Because Stream.cycle generates indefinet list.
batch_size we know from create_dataloader_v1, it is value how we chunk dataset
shuffle randomize input dataset original order that we got from chunk_dataset, that helps us to get better llm model.
num_workers is used in iterate function. In iterate function we run over batches some function of our choice. If num_workers if > 1, then we do this concurrently.
drop_last will be explained with test as example:

test "data loader drop_last true drops trailing incomplete batch from tuple dataset" do
    dataset = [
      {Nx.tensor([40, 367, 2885, 1464]), Nx.tensor([367, 2885, 1464, 1807])},
      {Nx.tensor([1807, 3619, 402, 271]), Nx.tensor([3619, 402, 271, 10899])},
      {Nx.tensor([10899, 2138, 257, 7026]), Nx.tensor([2138, 257, 7026, 15632])},
      {Nx.tensor([15632, 438, 2016, 257]), Nx.tensor([438, 2016, 257, 922])},
      {Nx.tensor([922, 5891, 1576, 438]), Nx.tensor([5891, 1576, 438, 568])},
      {Nx.tensor([568, 340, 373, 645]), Nx.tensor([340, 373, 645, 1049])},
      {Nx.tensor([1049, 5975, 284, 502]), Nx.tensor([5975, 284, 502, 284])},
      {Nx.tensor([284, 3285, 326, 11]), Nx.tensor([3285, 326, 11, 287])}
    ]

    batch_size = 3

    dataloader_keep_last =
      LlmScratch.DataLoader.new(dataset, batch_size: batch_size, shuffle: false, drop_last: false)

    dataloader_drop_last =
      LlmScratch.DataLoader.new(dataset, batch_size: batch_size, shuffle: false, drop_last: true)

    assert Enum.take(dataloader_keep_last.stream, 3) == [
             [
               {Nx.tensor([40, 367, 2885, 1464]), Nx.tensor([367, 2885, 1464, 1807])},
               {Nx.tensor([1807, 3619, 402, 271]), Nx.tensor([3619, 402, 271, 10899])},
               {Nx.tensor([10899, 2138, 257, 7026]), Nx.tensor([2138, 257, 7026, 15632])}
             ],
             [
               {Nx.tensor([15632, 438, 2016, 257]), Nx.tensor([438, 2016, 257, 922])},
               {Nx.tensor([922, 5891, 1576, 438]), Nx.tensor([5891, 1576, 438, 568])},
               {Nx.tensor([568, 340, 373, 645]), Nx.tensor([340, 373, 645, 1049])}
             ],
             [
               {Nx.tensor([1049, 5975, 284, 502]), Nx.tensor([5975, 284, 502, 284])},
               {Nx.tensor([284, 3285, 326, 11]), Nx.tensor([3285, 326, 11, 287])}
             ]
           ]

    assert Enum.take(dataloader_drop_last.stream, 3) == [
             [
               {Nx.tensor([40, 367, 2885, 1464]), Nx.tensor([367, 2885, 1464, 1807])},
               {Nx.tensor([1807, 3619, 402, 271]), Nx.tensor([3619, 402, 271, 10899])},
               {Nx.tensor([10899, 2138, 257, 7026]), Nx.tensor([2138, 257, 7026, 15632])}
             ],
             [
               {Nx.tensor([15632, 438, 2016, 257]), Nx.tensor([438, 2016, 257, 922])},
               {Nx.tensor([922, 5891, 1576, 438]), Nx.tensor([5891, 1576, 438, 568])},
               {Nx.tensor([568, 340, 373, 645]), Nx.tensor([340, 373, 645, 1049])}
             ],
             [
               {Nx.tensor([40, 367, 2885, 1464]), Nx.tensor([367, 2885, 1464, 1807])},
               {Nx.tensor([1807, 3619, 402, 271]), Nx.tensor([3619, 402, 271, 10899])},
               {Nx.tensor([10899, 2138, 257, 7026]), Nx.tensor([2138, 257, 7026, 15632])}
             ]
           ]
  end

If we divide the number of dataset elements by batch_size, in this test we get 8/3 with a remainder of 2. With drop_last, we can drop the final batch if its size differs from the others.

That is it for the dataloader, which helps organize training data. Now we move to token embeddings.

Token embeddings

input token IDs {batch, seq_len}
  -> embedding lookup table {vocab_size, emb_dim}
  -> token vectors {batch, seq_len, emb_dim}

To use token IDs for LLM training, we must convert them (remember Bruce’s construct |> reduce |> convert?) into embedding vectors. But what is an embedding vector, how do we create it, and why do we need it?
We need embeddings because an LLM is a neural network, and token IDs like 7 and 8 are just integers with no built-in notion of similarity. Embeddings use floating-point vectors so the model can learn relationships between tokens. We also use matrices for the same reason: to represent how tokens relate to one another. Next is the “what” and “how.” Python has torch.nn.Embedding. In Elixir, we have Nx and Axon. Meet LlmScratch.EmbeddingNative.

  defmodule LlmScratch.EmbeddingNative do
  @moduledoc """
  Native embedding layer implemented with Nx and Axon.

  Provides a small API compatible with `LlmScratch.Embedding`:

    * `new/3`
    * `weight/1`
    * `forward/2`
    * `call/2`
  """

  defstruct [:weight, :vocab_size, :embedding_dim, :seed]

  @type t :: %__MODULE__{
          weight: Nx.Tensor.t(),
          vocab_size: pos_integer(),
          embedding_dim: pos_integer(),
          seed: integer() | nil
        }

  @spec new(pos_integer(), pos_integer(), keyword()) :: t()
  @doc """
  Creates a new embedding layer.

  ## Options

    * `:seed` - integer seed used to initialize weights deterministically.
    * `:weight` - optional pre-initialized weight tensor with shape
      `{vocab_size, embedding_dim}`.
  """
  def new(vocab_size, embedding_dim, opts \\ [])
      when is_integer(vocab_size) and vocab_size > 0 and is_integer(embedding_dim) and
             embedding_dim > 0 do
    seed = Keyword.get(opts, :seed)

    weight =
      case Keyword.get(opts, :weight) do
        %Nx.Tensor{} = provided_weight ->
          expected_shape = {vocab_size, embedding_dim}

          if Nx.shape(provided_weight) != expected_shape do
            raise ArgumentError,
                  "expected :weight shape #{inspect(expected_shape)}, got: #{inspect(Nx.shape(provided_weight))}"
          end

          Nx.as_type(provided_weight, {:f, 32})

        nil ->
          init_weight(vocab_size, embedding_dim, seed)
      end

    %__MODULE__{
      weight: weight,
      vocab_size: vocab_size,
      embedding_dim: embedding_dim,
      seed: seed
    }
  end

  @spec weight(t()) :: Nx.Tensor.t()
  @doc """
  Returns the embedding weight matrix.

  The returned tensor has shape `{vocab_size, embedding_dim}`.
  """
  def weight(%__MODULE__{weight: weight}), do: weight

  @spec forward(t(), Nx.Tensor.t()) :: Nx.Tensor.t()
  @doc """
  Maps token ids to embedding vectors.

  `token_ids` is cast to `{:s, 64}` and used as row indices into the
  embedding matrix.
  """
  def forward(%__MODULE__{weight: weight}, token_ids) do
    token_ids
    |> Nx.as_type({:s, 64})
    |> then(&Nx.take(weight, &1, axis: 0))
  end

  @spec call(t(), Nx.Tensor.t()) :: Nx.Tensor.t()
  @doc """
  Alias for `forward/2`.
  """
  def call(embedding, token_ids), do: forward(embedding, token_ids)

  defp init_weight(vocab_size, embedding_dim, seed) do
    key =
      case seed do
        nil -> LlmScratch.Random.manual_seed(System.unique_integer([:positive]))
        int when is_integer(int) -> LlmScratch.Random.manual_seed(int)
        other -> raise ArgumentError, "seed must be an integer or nil, got: #{inspect(other)}"
      end

    Axon.Initializers.uniform(scale: 1.0).({vocab_size, embedding_dim}, {:f, 32}, key)
  end
end

This module returns a struct that includes vocabulary size, embedding dimension, seed, and a weight tensor.
The new function requires vocabulary size and embedding dimension, while weight and seed are optional. If we provide weight, it must be an Nx.Tensor with shape {vocab_size, embedding_dim}. Then Nx.as_type casts the weight values to float32.
If we do not provide weight, the private init_weight function is used. First, it calculates a random key from seed. If seed is not set, it uses a random positive integer. Why do we need a random key? Because we start with random weight values for token IDs. LlmScratch.Random.manual_seed(int) is based on Nx.Random.key(int):

Nx.Random.key(123)
#Nx.Tensor<
  u32[2]
  [0, 123]
>
iex(2)> Nx.Random.key(123) |> Nx.shape
{2}
iex(3)> Nx.Random.key(123) |> Nx.rank
1

So the random key is an Nx.Tensor (because we do math operations with other Nx.Tensors), with shape {2} and rank 1, where values are 0 and seed.

The final call is Axon.Initializers.uniform(scale: 1.0), which generates an initializer function that fills weights with random values from a uniform distribution. The weight shape is {vocab_size, embedding_dim}, values are cast to float32, and the random key controls initialization.
forward maps token IDs to weights. Let’s look at a test to see how this works:

test "Elixir-style EmbeddingNative with manual_seed (torch.nn.Embedding equivalent)" do
    vocab_size = 6
    embedding_dim = 3

    embedding = LlmScratch.EmbeddingNative.new(vocab_size, embedding_dim, seed: 123)

    assert embedding.vocab_size == vocab_size
    assert embedding.embedding_dim == embedding_dim

    weight = LlmScratch.EmbeddingNative.weight(embedding)

    expected_weight =
      Nx.tensor(
        [
          [0.3373701572418213, -0.1777772158384323, -0.16895616054534912],
          [0.9177640080451965, 1.5809690952301025, 1.3010399341583252],
          [1.275301218032837, -0.20095309615135193, -0.16056379675865173],
          [-0.40148791670799255, 0.966571569442749, -1.1481444835662842],
          [-1.158868670463562, 0.32547101378440857, -0.6315054297447205],
          [-2.839993953704834, -0.7848533391952515, -1.4095723628997803]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(weight, expected_weight, atol: 1.0e-6),
           "Embedding weights should match PyTorch's weights exactly with seed=123"

    row_3 = Nx.slice(weight, [3, 0], [1, embedding_dim])

    expected_row_3 =
      Nx.tensor([-0.40148791670799255, 0.966571569442749, -1.1481444835662842], type: {:f, 32})

    assert Nx.all_close(row_3, expected_row_3, atol: 1.0e-6)

    input_ids = Nx.tensor([[2, 3, 5, 1]], type: {:s, 64})
    embeddings = LlmScratch.EmbeddingNative.forward(embedding, input_ids)

    expected_forward =
      Nx.tensor(
        [
          [
            [1.275301218032837, -0.20095309615135193, -0.16056379675865173],
            [-0.40148791670799255, 0.966571569442749, -1.1481444835662842],
            [-2.839993953704834, -0.7848533391952515, -1.4095723628997803],
            [0.9177640080451965, 1.5809690952301025, 1.3010399341583252]
          ]
        ],
        type: {:f, 32}
      )

    assert Nx.all_close(embeddings, expected_forward, atol: 1.0e-6),
           "Embeddings from forward pass should match expected values exactly"
  end

In this test, for simplicity, we have a vocabulary of six tokens, with possible values [0, 1, 2, 3, 4, 5]. We create an embedding with shape {6, 3}, and expected_weight is an Nx.tensor with random values based on seed 123. We do not assert with ==; we use Nx.all_close with a threshold of 1.0e-6 (0.000001), so values are considered equal if they differ only within that tolerance. The reason is that weights are floating-point values.
Nx.slice is a handy way to extract data from a weight Nx.Tensor. We extract from the fourth row and first column, taking one row and three columns.
Here is an example of using forward. We first create token IDs as an Nx.Tensor of type {:s, 64}, and then call forward on the embedding. The result is an Nx.Tensor that contains the 3rd, 4th, 6th, and 2nd rows from the weight tensor, because those are the ID values (zero-based indexing). Under the hood, forward uses Nx.take.

Encoding token positions

token embeddings (same token -> same vector)
                +
positional embeddings (position 0..n-1)
                |
                v
order-aware embeddings for the transformer

We are almost done. We have token embedding vectors, but one important ingredient is still missing: token position (order) in the input text. This means that for our previous token IDs, [2,3,5,1], token ID 3 will always be mapped to the same row in the weight tensor, regardless of where it appears in the sequence. Here is a ChatGPT 5.3 explanation:

Note:

It’s a problem because language meaning depends on word order, not just which words appear. If the embedding layer maps a token ID to the same vector everywhere, then:

“dog bites man” and “man bites dog” start with the same set of token vectors (just reordered)

The model has no built-in way to know which token came first, second, etc.

So it loses syntax/order information (who did what to whom)

In other words, token embeddings alone encode identity (“this is the word dog”), but not position (“this dog is at position 3”).

That’s why transformers add positional encodings/positional embeddings: to combine

token meaning

token position
so the model can distinguish sequences with the same tokens in different orders.

We have absolute and relative positional embeddings. With absolute positional embeddings, we encode the position of each token. With relative positional embeddings, we encode how far apart tokens are from each other. OpenAI ChatGPT uses absolute positional embeddings. Let’s add positional embeddings in a more realistic example:

test "positional embedding" do
    previous_backend = Nx.default_backend()
    Nx.default_backend(EXLA.Backend)
    on_exit(fn -> Nx.default_backend(previous_backend) end)

    vocab_size = 50257
    embedding_dim = 256

    token_embeding_layer = LlmScratch.EmbeddingNative.new(vocab_size, embedding_dim, seed: 123)

    {:ok, file_content} = File.read("the-verdict.txt")

    dataloader =
      LlmScratch.GptDatasetV1.create_dataloader_v1(
        raw_text: file_content,
        batch_size: 8,
        max_length: 4,
        stride: 4,
        shuffle: false,
        drop_last: false,
        num_workers: 0
      )

    batch_0 = dataloader.stream |> Enum.at(0)
    inputs_list = Enum.map(batch_0, fn {input, _target} -> input end)

    # Stack list of tensors into a single tensor: [tensor1, tensor2, ...] -> tensor with shape [batch_size, seq_len]
    inputs = Nx.stack(inputs_list)

    expected_inputs =
      [
        [40, 367, 2885, 1464],
        [1807, 3619, 402, 271],
        [10899, 2138, 257, 7026],
        [15632, 438, 2016, 257],
        [922, 5891, 1576, 438],
        [568, 340, 373, 645],
        [1049, 5975, 284, 502],
        [284, 3285, 326, 11]
      ]
      |> Nx.tensor(type: {:s, 32})

    # Verify inputs match expected output exactly
    assert Nx.all_close(inputs, expected_inputs, atol: 1.0e-6),
           "Input embeddings should match expected values exactly"

    # Assert the size/shape of inputs is [8, 4]
    assert Nx.shape(inputs) == {8, 4}

    # Get token embeddings: shape [8, 4, 256]
    token_embeddings = LlmScratch.EmbeddingNative.forward(token_embeding_layer, inputs)
    assert Nx.shape(token_embeddings) == {8, 4, 256}

    # Create positional embedding layer: vocab_size=4 (positions 0,1,2,3), embedding_dim=256
    positional_embedding_layer = LlmScratch.EmbeddingNative.new(4, 256, seed: 123)
    positional_embedding_weights = LlmScratch.EmbeddingNative.weight(positional_embedding_layer)
    assert Nx.shape(positional_embedding_weights) == {4, 256}

    # Create positional indices: [0, 1, 2, 3] for each position in the sequence
    # Shape: [4] -> expand to [1, 4] -> broadcast to [8, 4]
    positional_indices = Nx.tensor([0, 1, 2, 3], type: {:s, 64})
    positional_indices_batch = Nx.broadcast(Nx.new_axis(positional_indices, 0), {8, 4})

    # Get positional embeddings: shape [8, 4, 256]
    positional_embeddings =
      LlmScratch.EmbeddingNative.forward(positional_embedding_layer, positional_indices_batch)

    assert Nx.shape(positional_embeddings) == {8, 4, 256}

    # Add token embeddings and positional embeddings: shape [8, 4, 256]
    embeddings_sum = Nx.add(token_embeddings, positional_embeddings)
    assert Nx.shape(embeddings_sum) == {8, 4, 256}
  end

We first set the EXLA backend because Elixir is slow for this type of computation without it. On my MacBook Pro M3, this test takes about 71 seconds without EXLA, but under second with EXLA. Then we use the full vocabulary size of the BPE tokenizer and an embedding dimension of 256. ChatGPT-3 has 12,288 dimensions. We load The Verdict story into 8 batches with a maximum length of 4 and a stride of 4. Then we extract the inputs from the first batch (batch_0). To get token embeddings, we combine the embedding layer and the inputs, which gives us a tensor with shape {8,4,256}.
Next, we create absolute positional embeddings. The vocabulary size is 4, and the number of dimensions is the same: 256. The positional inputs are zero-based indices, [0,1,2,3]. Nx.broadcast is a handy way to turn these positional indices into an Nx.Tensor with shape {8, 4}.
The last thing to do is add these two tensors with Nx.add. Now we are ready to build attention mechanism!

Conclusion

We finally got our hands dirty with Elixir code, and we survived because we were able to do everything in Elixir instead of Python. EXLA is supported by Nx out of the box, which helps with heavy computations, because this is one area where Elixir does not shine. I reread this post, and now I understand the codebase and the theory better. Next, I want to reread Giles’s blog posts about Chapter 2 (part 2, part 3) to fill possible gaps in my knowledge and answer the Chapter 2 questions from the workbook.

Foundation of Taking Testing Seriously

2026-01-23T00:00:00+00:00

TL;DR

In the previous post on Taking Testing Seriously Chapter 1, the authors explained why we need another book about software testing. The reason is very simple: a book like that has never been written about software testing. It sets the tone and philosophy so you can decide if this approach is for you or not. Let’s move on to Chapter 2, The Foundation. Again, I will move through my Botie.app notes to deepen my understanding of the foundations.

el3e

The authors stress again that Testing is a human, social, and technical activity. To be honest, we all start with the technical activity, forget the other two, and because of that miss a lot of bugs. It is human because most of our colleagues are human (I also have two LLM colleagues), and social because humans are social beings.

As it should, the foundation states the definition of software testing. It is the el3e mnemonic:

Software testing is the process of evaluating the product by learning about it by experimenting, experiencing and exploring with it.

Isn’t this more fun to do than to write browser automation scripts!?

The evidence

We are gathering evidences about the product under test. To determine if this evidence matters to the client, we need to apply the CRC pattern. CRC is my addition, not James and Michael’s. CRC comes from Elixir; it was developed and popularized by Bruce Tate, a well-known Elixir instructor. If you’re curious to learn more about Bruce Tate’s teachings or Elixir in general, check out Groxio, his training site. Construct, Reduce, Convert. We first need to construct the evidence, telling a story about what we found in the product. Then we need to reduce it to the bone so it contains only the essential information we can sell to the client. Convert is the final rewrite, in the terms of the client’s business domain (this risk could be realized).

Learning opportunity

The authors state that each product failure is a learning opportunity for a software tester. I would add it is the same for the developer. If you do not have the urge to open champagne when you find a product issue, then maybe software testing is not the right career choice. Here is my example. I am developing a feature that integrates a client product with another product. During the connect phase, we get, via REST API, a number of objects from the other product. My integration handles hundreds of objects, while others handle tens. This integration happens in the background, and it takes time—I can see hundreds of requests in the logs. I kept that in the back of my mind. The application was working, but it strangely changed data in the UI after the logs were done. QA reported one issue that I could not replicate at first because I was waiting for the load to be done. QA did not know about those loads and used the UI immediately. Keeping in mind the loading time for the logs, I managed to replicate and fix the async issue. I changed the order of the data load so we first load the data that is immediately used in the UI. I have never dismissed those REST API requests in the logs since they finally fit the application model.

Checking vs testing

I witnessed a lot of bad blood spill during the conversation about this topic. Here is what James and Michael say about it. In programming you need to understand both the business and machine sides of the story. Memory management, networking, and disk are parts of the machine story. In the end, you are programming a machine. Programmers perform checking, where checking is the mechanistic (automated) process of verifying propositions about the product. Those propositions can only be true or false; there is never a third option. And now the important thing about quality.

Quality is an opinion, not a fact; quality cannot be verified.

If quality is an opinion, how could we assure or assist quality (QA)? We cannot quantify quality, because it is value to a person that matters (Weinberg). Testing cannot be automated, but checking can. Testing is an encounter with a product, an activity, not an artifact. Testing is not an idea. It is everything a tester does as part of their role to fulfill the mission of testing.

Testing

Deep testing uncovers bugs that are challenging to detect and issues that truly matter. Preparing for deep testing requires more time and effort, making it a more demanding activity, while shallow testing typically exposes the obvious bugs. Oracles help us decide whether something is actually a bug or not, and in deep testing, we have to be especially thoughtful when designing these oracles.

You probably recognize this pattern: we often start with shallow testing and then proceed to deep testing. A bug that matters is one that ties directly to significant business risk.

For testers, the product is often a mystery—we are searching for problems, not simply confirming that things work. Our real aim is to uncover business risks.

Testers use heuristics, not rigid algorithms. Clear terminology is crucial because abbreviations can cause confusion. Testing is an event: testers work on behalf of a client and should do their best to simulate, anticipate, and inhabit the client’s perspective. In Modern Testing, Alan Page states that only the customer can truly evaluate the quality of a product. This isn’t a contradiction: our job is to prepare the product for real customers, ensuring that when it reaches them for the final test, it will meet or exceed their expectations. Some argue that it is impossible to test exactly like a client because clients are driven by business success and will use the application in ways testers might never imagine. So what counts as fake testing? Here are a few signals:

creating test cases before seeing the product
blindly following a script and dismissing all side findings
reporting only pass/fail rates
constraining testing to whatever the test tool supports
using a process we do not understand
pushing reported bugs so aggressively that it looks like we are angling to take over as project lead

We need to be humble about our testing expertise (I still consider myself an evolving software developer and tester who can learn new things about software testing from anybody).

I have expertise, instead of I am expert.

The Model

So how do we test? We create a model of a product. Two different testers would have two different models of the same product. We try to answer the question: how good is the product? A flowchart diagram or product requirements are examples of a product model. How do we create those models? We do black/white/gray box testing. We explore the product using one of those techniques, and based on the findings (evidence), we create a better product model. And there is a famous testing fallacy that we do not need to do integration testing because we have unit testing. Do not fall for that—black/white/gray box testing can uncover a lot of interesting product evidence. If you run out of testing ideas while you try to simulate the product user, page 35 has a list of sixteen types of product user.

Jerry Weinberg

Quality is value to some person.

Bragging time. I had the pleasure of meeting Jerry in the flesh at the STP testing conference in 2010, where he signed my copy of Perfect Software and Other Illusions About Testing. Michael and James’s work is deeply inspired by Jerry’s work. When you ask them what book they recommend on software testing, the answer is “Anything from Jerry Weinberg.” Remember, quality is value, and as we have already stated, value cannot be quantified into numbers, and because of that it cannot be measured (verified). So we gather product evidence, report it, and then the user decides whether it is valuable.

Then we have the lucid fallacy. This is when product quality is defined as a set of predefined tests that must pass. If those pass, we claim the product has quality. Or there could be a set of requirements. And then we stop listening to the users of the product.

Risk

Risk is the danger of the product doing something bad or failing to do something good that it is supposed to do. Risk potential directs our testing because we cannot test all product variations. Risk could be unknown, identified, or tested. Unknown risks are the hardest—we interact with the product and then must figure out which evidence could be a risk. For an identified risk we have direction on how to test. “Tested” is when we have evidence that the risk could not be fulfilled.

Context

It is common that for any question you ask a context-driven tester, the answer is, “It depends on the context.” There are three principles of context-driven testing:

the value of any practice depends on its context
there are good practices in the context, but there are no best practices
people who work together are the most important part of any project’s context

Simple as that. And yes, project context changes over time—especially in startups when pivoting is a normal modus operandi. Every tester must have a testing mission. What is the purpose of our testing? That mission must be agreed upon with the client. Mixing other roles into testing is not good; the reason is simple: testing is a full-time job. But we can ask for testing help from other teammates. In that case, we must take care that testing is convergent, not divergent.

The knowledge

We have tacit and explicit knowledge. Tacit knowledge is for solving problems, and explicit knowledge is when we can explain those problems. Exploratory testing is moving through a space and learning about it. We use the product, and we learn about it by making choices when we encounter the unknown. If others make choices for us, then they are explorers; we are just a vessel. Opponents of exploratory testing say that there is no formality in it, but they are wrong. Exploratory testing moves from freedom to formality as we discover more product patterns. We cannot get something right the first time; this is why we iterate. At the beginning of the project we have confusion, and this is expected. We must know how to frame our testing. We must explain how our testing is related to business needs. In general systems thinking (Jerry Weinberg), the trick is to know what and when to ignore findings about the product. How do test and strategy relate to each other? Testing produces a story of product status, and strategy is the story of how we are going to achieve that testing.

The boundary bug

I found one instance of a boundary bug: the last footnote only has the number, but the footnote text is missing.

The heuristic

A heuristic is a fallible means of solving a problem. For example, my sink has a 9 cm drain hole. In order to clean the sink trap, I need a big screwdriver. My heuristic is that my biggest screwdriver will be enough. That was not the truth; my heuristic failed.

Solving a problem can involve many things—for example, understanding the situation, discovering the problem, or making a decision. Our job is to craft heuristics that are less likely to fail. Now we can also explain best practices vs. good practices. A “best practice” is when some authority claims, without proof, that a particular practice should be used to solve a problem. Heuristics tell a different story: here is a practice, here is how it works, why it works, and how it may not work—so be careful. You need skill to apply heuristics, and you need practice to build that skill. Heuristics is when you develop judgement in your context.

The soul of rst

rst teaches us about “enoughness”—how to recognize that we can stop testing. Testers naturally have distance from the product (they did not create it), which often makes them better at spotting issues than developers. We should focus when we feel confused and defocus when we feel frustrated. During focused testing, change one variable at a time; during defocused testing, change many. Focus reduces chaos and confusion, while defocus helps us uncover new ideas.

For rst, experience is what happens to the tester during the testing activity. The challenge is capturing that experience, which is different from merely collecting screencasts or logs. Sensemaking is a social activity where we interpret data, but unconscious bias can cause us to miss bugs. Testing is an experiment—we seek surprises in the product. If we already know what will happen, it’s just a demonstration. We aim for product knowledge first, then surprises. Because any metric can be gamed, assessment relies on thoughtful, open-ended questions about quality. That’s how rst evaluates testing, not by counting bugs.

Closing

Just as Asimov’s Foundation collected the Empire’s knowledge, this chapter gathers the essential rst building blocks. And guess what—we start testing in the next chapters.

Chapter 1 recap of Build LLM from Scratch

2026-01-20T00:00:00+00:00

TL;DR

I reviewed every Chapter 1 study note I published and reread Giles Thomas’s first “LLM from Scratch” blog to cement the fundamentals, capturing several new takeaways from the chapter.

Integration test of LLM on my blog posts

Recently I switched from WordPress to Jekyll on GitHub Pages, which gave me a more efficient workflow. I now draft posts in Cursor, use Cursor or Codex agents, and no longer copy and paste to the ChatGPT UI. When I finished my recent LLM series, I asked an agent to check and fix the grammar. I knew that several answers to the Chapter 1 study questions were incomplete or wrong. Because I trusted the LLM, I never reviewed its edits. Since it predicts the next token based on patterns, it “fixed” my incorrect answers instead of just polishing the grammar. So I lost my history of learning about llms. My prompt never stated “fix grammar only,” so the model tried to improve the content. I attempted a review using git diff, but with Markdown every paragraph sits on a single line, which made tracking changes hard. From now on I will proofread the rendered page in my local Jekyll instance.

That experience reinforced Chapter 1’s explanation of how an LLM works: it predicts the next token based on the weights it has learned.

What I learned from Giles Thomas’s blog post

The first Transformer architecture had both an encoder and a decoder. I had missed that the encoder is also a neural network with its own attention blocks: the encoder produces contextual token representations, and the decoder consumes them to generate output tokens. Today we mostly see decoder-only architectures, yet encoder-only models still exist. Google’s BERT is a well-known example that handles classification and masked-word prediction but not token generation. Giles also highlighted that traditional machine learning includes algorithms that auto-label data from their inputs—Naive Bayes is one such case, so “automatic labeling” is not unique to LLMs.

Conclusion

An important part of learning about LLMs is to use them. Applying one to my own posts showed me how it behaves on real content and why I must review every change. Rereading Giles’s blog post was also the right move; it reinforced my Chapter 1 knowledge and clarified the points I previously glossed over.

Building a large language model

2026-01-12T00:00:00+00:00

TL;DR

This is the sixth post in my LLM learning series. In the previous post, we explored the GPT architecture in detail. Today, we dive into the process of building a large language model.

What are the three main stages involved in building a large language model from scratch?

The first stage is data preparation. Various data sources like books, web forums, and blog posts must be scraped into files and prepared for LLM training. The second stage is LLM pretraining, where we create a base model that understands the training language’s grammar, patterns, and nuances. The final stage is fine-tuning the base model on a smaller dataset so the LLM can perform specific tasks with more accuracy.

The three stages are implementing the LLM architecture and data preparation process, pretraining an LLM to create a foundation model, and fine-tuning the foundation model for specific tasks.

I cheated a little bit on this question. I only knew two stages, so I reached for my botie.app notes and remembered that the first stage is data preparation.

What is the key idea behind the transformer architecture used in LLMs?

The key idea is auto-labeling of input data during the model pretraining stage.

The transformer architecture utilizes an attention mechanism that allows the LLM to selectively access the entire input sequence when generating output, word by word.

This one was a total miss :)

What is the primary task used for pretraining LLMs like GPT-3?

The primary task in the pretraining stage is to make the LLM understand language grammar, patterns, and nuances.

LLMs like GPT-3 are pretrained on a massive corpus of text by predicting the next word in a sentence, using this prediction as a label.

Another failure.

Explain the concept of emergent properties in LLMs.

The emergent properties concept is when an LLM can solve tasks that we did not expect it to solve, like language translation.

While the primary pretraining task for GPT-like models is next-word prediction, they exhibit emergent properties, meaning they can perform tasks like classification, translation, and summarization without explicit training for those tasks

I forgot to mention summarization and classification as examples of emergent tasks.

Why is fine-tuning a pretrained LLM beneficial for specific tasks?

It is beneficial because the LLM can then better solve specific tasks in the domain of the fine-tuned dataset. An example is answering questions about specific medical problems.

Fine-tuning a pretrained LLM on a custom dataset allows it to specialize in specific tasks and potentially outperform general LLMs on those tasks.

The key point I forgot is that it can outperform the general (base) LLM in specific tasks.

Put these stages of creating a pretrained LLM (base model) in order:

A. Evaluate the model’s performance on text generation tasks. B. Implement the transformer decoder architecture (GPT- like). C. Prepare the text data by cleaning and tokenizing it. D. Train the model using a next-word prediction task on a large text dataset.

Step	Description
C	Prepare the text data by cleaning and tokenizing it.
B	Implement the transformer decoder architecture (GPT-like).
D	Train the model using a next-word prediction task on a large text dataset.
A	Evaluate the model’s performance on text generation tasks.

It seems that I am good at table questions; this one was correct.

Match the term to its description on the right:

Term	Description
Autoregressive Model	The task of predicting the next word in a sequence, which is used to train GPT models.
Self-Supervised Learning	A type of machine learning where the model learns from the data itself, without requiring explicit labels.
Next-Word Prediction	A type of model that generates text by predicting the next word in a sequence based on the words that have already been generated.
Decoder-Only Architecture	The architecture of GPT models, which uses only the decoder portion of the transformer architecture, making it suitable for text generation.

Term	Row
Autoregressive Model	Row 3
Self-Supervised Learning	Row 2
Next-Word Prediction	Row 1
Decoder-Only Architecture	Row 4

Conclusion

This section was also hard for me; mostly I could not remember the facts that I read one month ago. To recap, I will read chapter 1 again, read all my blog posts for chapter 1, and read Giles’s blog part 1.

A Closer Look at the GPT Architecture

2026-01-05T00:00:00+00:00

TL;DR

In the previous post, I discussed utilizing large datasets. Today, I am continuing with the study questions from Build LLM from Scratch by Sebastian Raschka, and the topic is a closer look at the GPT architecture.

What is the primary task that GPT models are trained on, and how does this relate to their ability to perform other tasks like translation?

The primary task that GPT models are trained on is to understand language nuances, context, and patterns.

GPT models are primarily trained on a next-word prediction task, which involves predicting the next word in a sequence. This seemingly simple task allows the models to learn the relationships between words and phrases, enabling them to perform other tasks like translation, even though they were not explicitly trained for it.

I totally failed on this question. I did not know what the primary task of a GPT model is. Ouch!

Explain the concept of self-supervised learning in the context of GPT models.

The concept of self-supervised learning is where we send input to the decoder module gradually, first sending the first token, then we send the first and predicted token back to the decoder module, and we repeat that until there is no predicted token.

GPT models utilize self-supervised learning, where the model learns from the data itself without requiring explicit labels. In the case of GPT, the next word in a sentence serves as the label for the model to predict, allowing for training on massive unlabeled text datasets.

Again, a total miss. Ok, I knew that the next word is important in the concept of self-supervised learning. So the number of labels in one sentence is equal to the number of words (tokens) in the sentence. And what comes before that label also matters.

How does the GPT architecture differ from the original transformer architecture, and what are the implications of this difference?

The original transformer architecture had an encoder and decoder, and the LLM transformer only has a decoder, where encoding/decoding happens.

The GPT architecture uses only the decoder portion of the transformer, making it a decoder-only model. This design makes it suitable for text generation and next-word prediction tasks, as it generates text one word at a time in a unidirectional, left-to-right manner.

This is what I do not understand: who does the encoder job in an LLM where we only have a decoder? So I asked Cursor:

In transformer architecture for llm, there is only decoder. Who then does encoder work from old transformer architecture

The answer:

Original Transformer (2017) — Encoder-Decoder

Encoder’s job: Read the entire input and create a rich representation (understanding)

Example: Read “The cat sat on the mat” and understand its meaning

Decoder’s job: Generate output using the encoder’s representation

Example: Generate translation “Le chat s’est assis sur le tapis”

The encoder could see ALL tokens at once (bidirectional attention).

GPT’s Decoder-Only Architecture

In GPT, the self-attention mechanism does both jobs.

How Self-Attention “Encodes”

When processing “The cat sat on the”:

Each token attends to all previous tokens
“on” can look at “The”, “cat”, “sat” to understand context
“the” can look at “The”, “cat”, “sat”, “on” to understand context

This builds up a contextual representation — similar to what an encoder does!

Key Difference: Causal Masking

Architecture	What each token can see
Encoder (BERT)	All tokens (past AND future)
Decoder (GPT)	Only past tokens (causal mask)

BERT (encoder): “sat” sees → The, cat, sat, on, the, mat

GPT (decoder): “sat” sees → The, cat, sat ← can’t see future

Why It Works

The decoder’s self-attention layers progressively build understanding through multiple layers:

Layer 1: Basic word relationships
Layer 2-N: Deeper semantic understanding
Final layers: Rich contextual representation

So the “encoding” happens implicitly within the decoder’s self-attention layers, just restricted to only seeing past context.

So we reached the layers concept, but more about that in future posts.

What is happening at the labels 1 and 2 in the diagram?

Label	Description
1	“is” token is the predicted token for “This”
2	“This is” is the output of iteration 1 and is the input for iteration 2

The answer:

Label	Description
1	The next word is created based on the input text
2	The output of the previous round becomes the input to the next round

Finally, one correct answer from my side!

What is the significance of GPT models being considered autoregressive models?

The significance of autoregressive models is that as we move through iterations, the model gets better at predicting the next token.

Autoregressive models, like GPT, incorporate their previous outputs as inputs for future predictions. This means that each new word generated by GPT is based on the preceding sequence, ensuring coherence and fluency in the generated text.

By getting better and better at predicting new tokens, I was thinking of predicting the correct token. But what is a correct token? It seems to be one that ensures coherence and fluency in the text. The predicted token should logically connect to previous tokens.

Example:

Coherent (what GPT produces):
> "The cat sat on the mat. It was warm from the sunlight streaming through the window."
"It" clearly refers to "the mat"
The second sentence logically connects to the first
The narrative flows naturally
Incoherent (if words were generated independently):
> "The cat sat on the mat. Purple ideas sleep furiously yesterday."
No connection between sentences
No logical flow

Describe the relationship between the size and complexity of GPT models and their capabilities.

As an LLM model is trained on a large set of input tokens (size is in billions), the LLM consists of a large set of weight parameters that help it to consistently predict the next token. The LLM is more complex as it contains a larger set of weight parameters.

GPT models, particularly GPT-3, are significantly larger than the original transformer model, with a greater number of layers and parameters. This increased size and complexity contribute to their ability to perform a wider range of tasks and achieve higher accuracy.

A GPT model is able to perform a wider range of tasks and achieve higher accuracy versus my answer: consistently predict next token, which is wrong. Tasks could be:

predict next token
classify input
summarize input
generate essay

Match the term on the left to its description on the right:

| Term | Description | |——————-|————-| | pretrained models | row 2 | | fine-tuning | row 3 | | base model | row 1 |

Correct answer

Conclusion

For me, this part was tough. We are scratching the surface of LLM magic, and things are getting complicated.

Utilizing Large Datasets

2026-01-04T11:00:00+00:00

TL;DR

In the previous post, I discussed the Transformer architecture. Today, I am continuing with the study questions from Build LLM from Scratch by Sebastian Raschka, and the topic is utilizing large datasets.

What are the key characteristics of the training datasets used for large language models like GPT-3 and BERT?

The key characteristics of the training dataset are size and diversity. For GPT-3, the training corpus had 410 billion tokens. For diversity, the dataset consisted of books, Wikipedia, blog posts and from sites like Reddit.

These datasets are vast, encompassing billions of words and covering a wide range of topics and languages. They are designed to expose the models to diverse text, enabling them to learn language syntax, semantics, and context.

I forgot to mention that various languages are used.

Explain the significance of the size and diversity of the training dataset for the performance of large language models.

As the training dataset becomes more diverse and larger, the LLM can predict the next word more precisely because it develops a better language model.

The scale and diversity of the training data allow these models to perform well on various tasks, including those requiring general knowledge. The models learn to understand and generate text that reflects the real-world complexities of language

I forgot to mention that the LLM can predict the next word for various tasks.

What is the concept of ‘tokenization’ in the context of large language models?

To tokenize input text means to break it into tokens, where each token is, for example, a word or special characters used in language like “?”. A token can be smaller than a word. Then we create an attention model on those tokens to determine how significantly they correlate to each other.

Tokenization is the process of converting text into individual units called tokens, which are the basic building blocks that the model reads and processes. These tokens can be words, punctuation marks, or other meaningful units of text.

This I answered correctly.

Describe the concept of ‘pretraining’ in the context of large language models and its significance.

In the pretraining phase, we create a large language model based on a general dataset, where the goal is that the LLM generally understands the structure of language, concepts, and patterns.

Pretraining involves training a large language model on a massive dataset to learn general language patterns and knowledge. This pre-trained model serves as a foundation, making it adaptable for various downstream tasks through fine-tuning, which involves further training on specific datasets for specific applications.

I forgot to mention that the result of the pretraining phase is a base for the fine-tuning process, where we train the LLM for specific tasks.

Explain the concept of ‘fine-tuning’ in the context of large language models and its advantages.

The fine-tuning process of an LLM is a phase where we use a smaller dataset than in the pretraining phase. The goal is to teach the LLM how to resolve a specific task. An example is detecting whether an email is spam or not. Fine-tuning is done on a pretrained LLM model that has general knowledge about the language.

Fine-tuning involves further training a pre-trained large language model on a smaller, task-specific dataset. This process adapts the model to perform well on specific tasks, such as text summarization or question answering, while leveraging the general knowledge learned during pretraining.

This is a correct answer.

Match the term on the left to its description on the right:

Term	Description
Encoder	row 3
Decoder	row 2
Self-attention mechanism	row 4
Zero-shot learning	row 1

This is correct!

Conclusion

An LLM without a dataset cannot exist. We learned the most important characteristics of LLM datasets.

Introduction to Transformer Architecture

2026-01-03T10:00:00+00:00

TL;DR

In the previous post, I discussed the stages of building and using LLMs. Today, I am continuing with the study questions from Build LLM from Scratch by Sebastian Raschka, and the topic is the Transformer architecture.

What is the transformer architecture and what is its significance in the development of LLMs?

The Transformer architecture consists of encoder and decoder parts. In the encoder, input text is broken into tokens, and each token is transformed into an embedding vector. The embedding vector encodes the token’s relation to other tokens. Token embeddings are input to the decoder, and the decoder predicts the most probable next word. LLMs only use the decoder part, where input text is broken into tokens, tokens are encoded to embeddings, and decoded to predict the next word.

The transformer architecture is a deep neural network architecture that revolutionized natural language processing. It’s the foundation for most modern LLMs, enabling them to process and understand language effectively.

I forgot to state that this is a deep neural network. My understanding was that LLMs only use the decoder part, and this is true. Old Transformers had encoder and decoder, and new ones only have the decoder.

Which label in the diagram does the output of the Embeddings stage go to?

Label 2. This picture shows the old Transformer with encoder/decoder blocks. Embeddings are input to the decoder blocks. Old Transformers were translators, and this picture shows translation from English to German.

Describe the two main components of the transformer architecture and their roles in language processing.

The two main components in the old Transformer are the Encoder and Decoder. The Encoder breaks text into tokens and then embeds each token into a vector (numerical value). The Decoder uses these embeddings as input and, based on that, predicts the next word.

The transformer architecture consists of an encoder and a decoder. The encoder processes the input text and converts it into numerical representations, while the decoder uses these representations to generate the output text.

On the spot! It seems that I correctly grasped the Transformer concept!

What is the self-attention mechanism and how does it contribute to the transformer’s effectiveness?

The self-attention mechanism is encoded in embeddings and helps the decoder pay attention to selected parts of the input text to generate the next token (word).

The self-attention mechanism allows the transformer to weigh the importance of different words in a sequence relative to each other. This helps the model capture long-range dependencies and contextual relationships, leading to more coherent and relevant output.

I missed the key concept: words pay attention to each other, and the result is an “understanding” of conceptual word relationships.

Explain the key differences between BERT and GPT models in terms of their training approaches and primary applications.

BERT is a text classifier, and GPT is a chatbot.

BERT focuses on masked word prediction and excels in tasks like text classification, while GPT is designed for generative tasks like text completion, translation, and summarization.

I learned that the three main features of GPT, or a chatbot, are text completion, translation, and summarization.

What are zero-shot and few-shot learning, and how do they relate to GPT models?

GPT completes sentences. It can also perform tasks. In zero-shot, we provide no examples and let the model figure it out. In few-shot, we provide several examples to guide the output.

Zero-shot learning allows GPT models to perform tasks without prior training on specific examples, while few-shot learning enables them to learn from a minimal number of examples. These capabilities demonstrate GPT’s versatility and adaptability

I got zero-shot wrong; as the name says, we use zero examples.

Zero-shot example:

Classify the sentiment of this review as positive or negative:

"The food was amazing and the service was excellent!"

Sentiment:

Answer: positive

Few-shot example:

Classify the sentiment of reviews:

Review: "Terrible experience, never coming back"
Sentiment: negative

Review: "Best pizza I've ever had!"
Sentiment: positive

Review: "The movie was boring and too long"
Sentiment: negative

Review: "I love this product, works perfectly!"
Sentiment:

Answer: positive

Conclusion

The Transformer architecture is the key concept of LLMs, and it is not a simple one. Answering these questions gave us an easy introduction to it.