Rough Notes

Testing Qwen3-Coder-Next - II

2026-03-31T00:00:00+00:00

While writing a more challenging test, I saw a new Qwen-based model in Reddit. Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF uses Supervised Fine-Tuning and LoRA with more than 13,000 reasoning examples generated with Claude Opus (mostly 4.6). The datasets used are listed here.

So I decided to test both Qwen3-Coder-Next and this new model to see the differences.

Running the models

As with Qwen3-Coder, I’ll use the 4-bit quantization, the only one that can reasonably run on my PC:

 hf download /
    Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF   /
    --include "Qwen3.5-27B.Q4_K_M.gguf"   /
    --local-dir ./models/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2

To run the models, I’m making some compromises:

llama-server /
    -m ./models/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2/Qwen3.5-27B.Q4_K_M.gguf /
    --jinja /
    --n-gpu-layers auto /
    --ctx-size 16384 /
    --batch-size 1024 /
    --ubatch-size 512 /
    -n -1

With 16GB of VRAM, not all the layers will fit on the GPU, so --n-gpu-layers auto let’s llama.cpp figure out the best configuration. For the context, --ctx-size 16384 was my first try, and it happened to be the right size for this exercise. For Qwen3-Coder-Next, I used the same non-params config as before:

llama-server /
    -m ./models/Qwen3-Coder-Next/Qwen3-Coder-Next-UD-Q4_K_XL.gguf

The test

I asked ChatGPT for a prompt for this test. He suggested this problem:

Implement a small local workflow engine in Python.

The full prompt is here. It defines some requirements (parallel execution, retries, replay,…), a specific json format, states and behavior, and some constraints. It will take me quite some time to implement this by myself.

The solutions

Qwen3-Coder-Next

It takes two and a half minutes to produce a solution. Less than 6,000 tokens, at 40 tokens/s

Llama.cpp data for Qwen3-Coder-Next

The full Qwen3-Coder-Next answer is here. It makes some comments on the design, gives a proposed file structure, and then shows the code in just one file for simplicity.

Comments are sparse, which is fine by me, but mostly useless. Naming is ok. Too much indentation at times (an if inside an if inside a while inside a with…). I’m not looking at the logic, I’ll ask someone (something?) to do it for me.

The answer includes a table showing how the code implements each requirement, with special emphasis on how it avoids race conditions (spoiler: it doesn’t). It finishes with instructions to run the code and a list of possible improvements.

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2

This is a thinking model, so we get some “thinking time” before getting the solution. Also, unlike the previous model, this one is not based on a MoE architecture, so all layers are always active, and we can’t move them from RAM to VRAM. Some layers will use the CPU instead of the GPU, reducing performance.

Llama.cpp data for Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2

The full answer is here. The answer starts with the reasoning context, where the model thinks of a plan before implementing the code.

The style is similar to Qwen3-Coder, but it looks less elegant: more loops, a long succession of if-elif statements for the events, function comments, but without any additional information.

Apart from that, instructions on running the code and the justification for some of the decisions. The justifications are too generic, so the relationship with the code isn’t clear.

The evaluations

To decide how good the solutions are, I sent all the info to both ChatGPT (GPT5.4 Thinking) and Claude 4.6 Extended. This is the prompt I used:

The prompt.md file contains a coding problem to test two local LLM models. The answers from each model are qwen3-coder-answer.md and qwen3.5-opus-answer.md. Analyze and compare the answers. List the good and the not-so-good from their answers, and give a final score to each.

The answers from each model are: ChatGPT, Claude.

Here is the summary from GPT 5.4 Thinking (model A is Qwen3-Coder-Next, model B is Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2):

Everybode passes! I wish I had a teacher like this at college.

So it doesn’t look too bad, until you look at the details on the most basic feature requested. For Qwen3-Coder:

Event log implementation is flawed

Uses temp + rename per append, but writes only one line → overwrites log

This breaks persistence and replay (critical bug).

For Qwen3.5-Opus:

Event log correctness issue (same as Model A)

Same temp+rename pattern → overwrites file → catastrophic.

I don’t know if I’d give a 7.5 for something that doesn’t work at all.

Let’s see what Claude thinks:

Not so fast!

Now, that sounds like a more reasonable score if, indeed, the basic features don’t work (although Claude thinks that the Opus version is almost ok, mmm, suspicious).

So, who’s right? There is only one way to find out: ~~review the code myself~~ ask the models again! But this time with the evidence from the other model 😈:

This is Claude’s review of the same code. The conclusions are different. Review them and tell me if you agree or you think Claude is wrong.

This is ChatGPT’s review of the same code. The conclusions are different. Review them and tell me if you agree or you think ChatGPT is wrong.

These are the answers: ChatGPT, Claude.

Sorry Claude, you were right

Ok, you were not that bad either, ChatGPT

Summary

In the end, neither version is good enough. Qwen3-Coder-Next makes a couple of really critical mistakes that make the whole project unusable, while Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2 errors are not as dramatic, but are bad enough to need serious revisions. Interestingly, once the critical bugs are fixed, the Qwen3-Coder-Next solution is better.

I would say that we can use these models to assist in the coding, but not as autonomous agents.

As a reference, this is a solution proposed by ChatGPT with Claude’s approval.

Testing Qwen3-Coder-Next

2026-03-24T00:00:00+00:00

Are local models useful for coding on a normal PC? Let’s find out.

Using `llmfit`

A good first step is using llmfit to figure out what some good options are:

llmfit running on my Linux box

The Inst column shows if the model is supported (-), if it is available (✓), if it is available through llama.cpp (L), Ollama (O), … Most of the data shown is scraped from the web, but for some models, it is calculated based on a few heuristics. This explains the difference between Qwen3-Coder-Next and Qwen3-Coder-Next-FP8. These two are the same model, with the -FP8 version already quantized to 8 bits.

But the way llmfit works is searching for the optimal quantized model (for example, from Unsloth) and calculating the scores from there. The differences then come from the fact that it uses the scraped data for Qwen3-Coder-Next, but for Qwen3-Coder-Next-FP8 it infers some params.

For example, these models (or model, because they are both the same) use a MoE architecture, where instead of a big FFN, we have several different small FFNs. According to the HF card, Qwen3-Coder-Next (and the FP8 variant) has 512 experts, of which only 10 are activated at the same time, with one additional shared expert. Each expert has a hidden layer of 512 dimensions.

So the main difference is in the number of active parameters, that is, the number of params for the active experts: for Qwen3-Coder-Next we have the 3B set in the HF card, but for Qwen3-Coder-Next-FP8 we have almost 5.5B. So this would make a better use of the GPU, while also leaving more RAM free. Maybe even allowing a better quantization.

Downloading the model

In any case, this is all wrong: there is only one model to quantize, and the FP8 (80GB) version wouldn’t fit in my RAM + VRAM.

So I’m going to download this one:

hf download unsloth/Qwen3-Coder-Next-GGUF   --include "Qwen3-Coder-Next-UD-Q4_K_XL.gguf"   --local-dir ./models/Qwen3-Coder-Next

The UD-Q4_K_XL quantization is a special unsloth dynamic quantization that is supposed to perform better.

Testing the model

Once downloaded, we run it with llama.cpp:

llama-server -m ./models/Qwen3-Coder-Next/Qwen3-Coder-Next-UD-Q4_K_XL.gguf

A simple exercise

The code generated is in this file. Qwen also added these comments:

Not bad!

The result isn’t bad: there are things to improve (project in FROM clause, location, no real reason for the ORDER BY), but the code works.

I will try something more complex when I have time.

Using Telegram to control Claude

2026-03-22T00:00:00+00:00

Anthropic has just released a feature called Channels, which allows a Claude session to communicate with an external system in one of two ways:

listening for events on an HTTP port
polling an external API

From https://code.claude.com/docs/en/channels-reference

The channel is an MCP server automatically run when we use the --channels parameter when launching Claude.

Testing channels with Telegram

We first need to create a Telegram bot and safely store the bot token.

Creating a bot in Telegram

We also have to verify that Claude’s version is 2.1.80 or later. To run the MCP server, Claude also needs a JavaScript runtime. I’m using bun, but I think node and deno are valid options too.

The next step is installing the Telegram plugin inside Claude:

/plugin marketplace add anthropics/claude-plugins-official # In case it wasn't already added
/plugin install telegram@claude-plugins-official
/reload-plugins

Now the Telegram plugin should be available, and we just have to configure the bot token:

/telegram:configure 123456789:AAHfiqksKZ8...

The final step is to pair the bot with Claude. For that, we DM the bot we created before:

If you did everything ok, you only need to say Hi! once

The bot’s response includes the command to pair the session inside Claude:

/telegram:access pair a12345

I said ‘Hi’ three times because the MCP server for the channel wasn’t working. I installed bun but I didn’t make sure the PATH was updated, so the MCP server was failing. You can check the MCP server status with /mcp:

Checking the mcp status

And that’s it. Now I can use Telegram as a front-end for an existing Claude session:

Now I want the same for Codex

Old man tries things, Nano Banana edition

2026-03-15T00:00:00+00:00

Update

This entry was written when the blog looked like this:

This blog used to be this ugly

Original text

I use Gemini or ChatGPT to generate images for my posts. And they are getting really good at this, especially Gemini’s Nano Banana 2.

I used Nano Banana 2 to generate two images. With this prompt:

A 10-year old kid playing a conversational game with a ZX Spectrum. The monitor is an old TV. The image is photorealistic, but with aesthetics from the 80s, and a Polaroid-like look. Make the proportions 16-9.

The first one was this:

Is the Spectrum too big or the rest of the image too small?

The mix of right and wrong details is amusing. The general aesthetic is very good, the key elements (ZX Spectrum, game on old TV, kid) are quite good (the Spectrum is too big). The game text is legible and consistent with conversational games.

On the other hand, too many posters and books mean lots of garbled text or deformed faces.

This is the second one:

Greetings Thorin!

This one is even better, in part because most of the text is out of focus. Some text in the foreground is garbled, but some out-of-focus text looks good (the cassette on the TV seems to read “The Hobbit”, and the imperial trooper in the background is in front of something that reads like Star Wars). The Spectrum has better proportions, and everything looks consistent with the date helpfully included in the Polaroid border.

My favourite detail is the game on the TV: it is actually The Hobbit, probably the best known conversational game for the Spectrum, but with graphics that look more like those of an NES.

As a check, I used the same prompt with ChatGPT Images:

TODO

While the worst image quality is consistent with an 80s Polaroid, everything in the image is worse: the Spectrum-like computer, the generic non-game-like text in the TV, the Rubik cube with a wrong perspective…

I’ll stick with Nano Banana 2 for the future.

Writing an agentic conversational game - II

2026-03-14T00:00:00+00:00

The implementation

Once the plan is defined, it’s time to start coding. I just fed OpenAI Codex the prompt as ChatGPT had defined it and generated this version: https://github.com/javiercancela/virtual-world/tree/v1

Just as a test, I used Cursor in Auto mode to add `uv:

Use uv, please

Running the code results in this interaction:

This image does not resolve into a safe action

Ok, now we need some actual local LLM to test. I deleted my previous llama.cpp install, so I downloaded and compiled it again:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 16

For the model, we use the Qwen3.5 family, with the Unsloth quantization:

hf download unsloth/Qwen3.5-4B-GGUF   --include "Qwen3.5-4B-UD-Q4_K_XL.gguf"   --local-dir ./models/Qwen3.5-4B
hf download unsloth/Qwen3.5-9B-GGUF   --include "Qwen3.5-9B-UD-Q5_K_XL.gguf"   --local-dir ./models/Qwen3.5-9B

The 4B version for routing, the 9B version for the rest.

We run one instance of llama.cpp for each, in different ports. To fit this into my system, I run the 9B model with the GPU and the 4B one with the CPU:

 ./llama.cpp/build/bin/llama-server  /
    -m ./models/Qwen3.5-9B/Qwen3.5-9B-UD-Q5_K_XL.gguf   /
    --alias Qwen3.5-9B   /
    --port 8081

CUDA_VISIBLE_DEVICES=""  ./llama.cpp/build/bin/llama-server  /
    -m ./models/Qwen3.5-4B/Qwen3.5-4B-UD-Q4_K_XL.gguf /
    -ngl 0  /
    -t 8  /
    -c 2048  /
    --port 8082

I used CUDA_VISIBLE_DEVICES=”” to force using the CPU and avoid compiling this model specifically for the CPU. But to make it work, I had to add the -ngl 0 param too. The option -c 2048 limits the context to use less memory, because context usage for routing is small.

I found an error after trying with the llama servers running:

...
slot update_slots: id  3 | task 910 | prompt processing done, n_tokens = 110, batch.n_tokens = 110
slot print_timing: id  3 | task 910 |
prompt eval time =      56.08 ms /   110 tokens (    0.51 ms per token,  1961.52 tokens per second)
 eval time =    2224.34 ms /   128 tokens (   17.38 ms per token,    57.55 tokens per second)
 total time =    2280.42 ms /   238 tokens
slot      release: id  3 | task 910 | stop processing: n_tokens = 237, truncated = 0
srv  update_slots: all slots are idle
srv          stop: cancel task, id_task = 910
srv  update_slots: all slots are idle
srv    operator(): got exception: {"error":{"code":500,"message":"Failed to parse input at pos 494: ","type":"server_error"}}
srv  log_server_r: done request: POST /v1/chat/completions 127.0.0.1 500

I just pasted the log to Claude and asked him to fix it. This is the fixed version: https://github.com/javiercancela/virtual-world/tree/v2.

This is the end result for this version:

Night Desk
Rain hammers against the steel shutters as Mara Voss leans over her desk, her
eyes fixed on the locked exit door. The only sound is the rhythmic clatter of
keys against the steel cabinet and the soft rustle of paper in the visitor
ledger. She offers a knowing smile, suggesting that the storm outside is just
as trapped as you are inside.
Commands: inspect , talk to Mara, use  on , take ,
move , inventory, look, help, quit.

>  What is the time according to the clock?
Rain drums against the high windows as Mara Voss measures your story, her eyes
fixed on the locked steel cabinet and the sealed exit door. The only things
within reach are the visitor ledger, the coat rack, the framed photo, and the
desk, all set against the cold glow of the office.

> Examine the framed photo
The photograph reveals Mara with a commendation ribbon pinned to her jacket,
while the engraved plaque beneath reads: "Attention kept us alive." It feels
less like sentiment and more like a test, the stark white paper contrasting
with the dim, humming light of the security office.

> Examine the visitor ledger
The ledger lies open on the desk, its final entry underlined twice in stark
red pencil, while the name "Silas Vale" stares back from the visitor's
signature. Mara's sharp eyes are already scanning the page, waiting for you
to confirm whether that name belongs in the report or the witness box.

> Ask Mara about Silas Vale
Rain drums against the high windows as Mara Voss measures your story, her gaze
fixed on the locked steel cabinet and sealed exit door. The only things within
reach are the visitor ledger, the framed photo, the coat rack, and the heavy
steel cabinet that remains stubbornly closed.

> talk to Mara
"The ink on that ledger doesn't dry quickly when someone is watching you, so
why are you still looking at me?" Mara's voice is low and steady, cutting
through the rain outside as she avoids your eyes. "If you believe Silas Vale's
name belongs in the witness box, then you know exactly what to do with the
ledger before I say another word."

> Yes, I think it begongs in the witness box
That does not resolve into a safe action. Rephrase it with a clear object or
request.

> open the desk
Using the desk that way does nothing useful.

> open the steel cabinet
The keypad waits for a three-digit code you do not have yet.

> inventory
Inventory: empty.

> take the ledger
You cannot take the visitor ledger.

> inspect the desk
The desk is a cluttered fortress of incident forms and a dead flashlight, the
visitor ledger already open to the latest page as if waiting for an answer.
The ink on the recent entries glistens under the flickering light, demanding
that you look closer at who has been here.

> read the ledger
Inventory: empty.

> inspect the ledger
The ledger's final entry trembles under double red underlines, the ink stark
against the white page. You trace the name Silas Vale, knowing Mara's eyes will
follow the trail of that signature. The silence in the office feels heavier
now, waiting for the next clue to fall into place.

> help
Commands: inspect , talk to Mara, use  on , take ,
move , inventory, look, help, quit.

> move ledger
You shift the visitor ledger, but it changes nothing useful.

> inspect the coat rack
Mara's raincoat drips onto the tiles, the metallic clink of a leather key clip
hanging empty from the sleeve seam. The rack remains barren, confirming that
whatever key belongs there is not on it. The silence of the office is broken
only by the rhythmic tapping of water against the floor.

> inspect the leather key clip
That does not resolve into a safe action. Rephrase it with a clear object or
request.

> tak the key clip
That does not resolve into a safe action. Rephrase it with a clear object or
request.

> take the key clip
That does not resolve into a safe action. Rephrase it with a clear object or
request.

> inspect Mara's raincoat
Mara Voss watches your every move with controlled impatience, her eyes scanning
the room as if expecting a trap. She looks tired and suspicious, refusing to
volunteer the cabinet code despite your persistent inquiries. The silence in
the Security Office grows heavier with each passing second.

> inspect raincoat
The empty clip suggests the important key is stored elsewhere.

> move the coat rack
You shift the desk, but it changes nothing useful.

> move coat rack
You drag the coat rack aside, revealing only the slick sheen of wet tile and a
faint trail of grit beneath it. No spare key is hidden in the shadows of the
wood, and the silence of the office remains unbroken by the sound of brass or
distant sirens.

> move framed photo
You straighten the framed photo, revealing a pristine, dust-free rectangle behind
it that offers no hidden switch. The recent handling of the frame only deepens
the room's quiet theme: notice the details, do not expect tricks. The air in the
Security Office remains still, waiting for the next observation.

> talk to mara
"Careful with the coat rack, I've been testing to see if you'll notice the grit
trail beneath it. If you think I'm hiding something in the office, you should be looking at the visitor ledger instead."

> move visitor ledger
You shift the visitor ledger, but it changes nothing useful.

> take visitor ledger
You cannot take the visitor ledger.

> talk to Silas Vale
The ledger you just inspected was a dead end, Silas, but it confirmed that
someone else has been watching this room. If you truly want to know what I'm
hiding, you need to get out here and talk to the visitor in the corner, not keep
asking me.

> talk to the visitor
That does not resolve into a safe action. Rephrase it with a clear object or
request.

> talk to mara
I know you want answers, Mara, but the visitor in the corner holds the truth I
cannot give you. Go inspect the ledger again; if that's a dead end, then someone
else is watching this room, and you're the only one who can see it.

> inspect the corner
That does not resolve into a safe action. Rephrase it with a clear object or
request.

> tell mara Silas Vale
Silas Vale is a man of many talents, yes, but his ledger is the only place his
true nature hides. Go back to the desk, open the ledger, and tell me what you
find there. If the pages are blank, then someone else is watching this room,
and you're the only one who can see it.

> open the ledger
Using the visitor ledger that way does nothing useful.

> inspect ledger
The ledger's final entry trembles under the red pencil, its double underlining
stark against the dim light. You recognize the name Silas Vale from the
visitor log, a detail Mara would likely dissect with sharp curiosity. The air
in the office feels heavy with the weight of a secret waiting to be uncovered.

> talk to mara
Silence is the only thing that keeps the shadows from shifting, so stop talking
and look at the ledger again. If you think you see something there, I will
believe you; if not, then someone else is already watching us.

> tell Mara I see something
The red pencil on the ledger is the only thing that matters right now, yet you
claim the steel cabinet holds a truth I cannot yet read. If that cabinet
contains more than a key, then the shadows are already shifting, and I am no
longer the only one watching us. Go back to the ledger; the answer to what
you see is written there, not in the cold steel.

> Silas Vale
That does not resolve into a safe action. Rephrase it with a clear object or
request.

Something is wrong with how the game decides that Mara should trust me. So the next step will be fixing this.

Writing an agentic conversational game - I

2026-03-12T00:00:00+00:00

The idea

My first computer was a Sinclair ZX81, but after only a few months, my father sold it and bought a ZX Spectrum. Thanks to the Spectrum, I learnt to code, but most of the time I spent with this computer, and this was a good part of my childhood, I was playing games.

And one of my favourite types of games was conversational games.

I liked the book more than the movies. And the game more than the book.

So why not try to recreate games like this one, but with LLMs to have actual conversations? How hard can it be? (Famous last words). Let’s find out!

The plan

My go-to model is GPT. At the moment of starting this project, GPT 5.4 was already announced but not yet released, so I used GPT 5.3 (web interface).

This is the initial prompt:

A reasonable request

And this is the answer:

Ha! Not that fast!

Ok then, let’s clarify things first:

Here you go!

And finally:

Finally!

The plan is quite comprehensive.

The review

But is it a good plan? Let’s ask Claude! I fed Claude Opus 4.6 the original prompt, the questions from ChatGPT, my answers, and the plan. This was Claude’s answer:

Not bad

I was interested in knowing more about the model recommendations:

My specs suck

I have an NVIDIA GeForce RTX 5060 Ti with an AMD Ryzen 9 7900 × 24 and 64 GB of RAM running on Ubuntu 24.04. This was his (its?) recommendation:

Outdated

A problem with conversations like this one is that the model finds a recent result (Qwen-3) but doesn’t bother to check if there is a better option that wasn’t available when the model was trained, as is the case with Qwen-3.5, which was announced in mid February. (To be fair, this happens less with ChatGPT). I have to ask Claude to double-check:

Are you sure?!

After that, he did a web search to get all the data:

Claude thinks he is talking to Trump

I find it interesting that he uses this sycophantic tone to recognize the mistake.

We are now ready to start the coding phase.

Reinforcement Learning II - Basic algorithms

2026-02-17T00:00:00+00:00

With the basic engine created, the next step is adding a couple of basic non-RL agents.

The heuristic agent

The first agent will follow a simple heuristic: make the most central move possible. For this, I wrote this prompt:

Add a new “heuristic” agent. The agent:

if it has a winning move, it plays it

if the opponent has one winning move, it blocks it

if a move gives the opponent a winning move, it avoids it

for the rest of the cases, it prioritices central moves, the more in the center the better

for ties, choose a random move

It implements it without a hitch. I test it against the random agent, playing the heuristic as the second player:

Running 1000 games: Random vs Heuristic

 100/1000 games completed
 200/1000 games completed
 300/1000 games completed
 400/1000 games completed
 500/1000 games completed
 600/1000 games completed
 700/1000 games completed
 800/1000 games completed
 900/1000 games completed
 1000/1000 games completed

========================================
Results
========================================

Random (Player 1):
 Wins:     13 (  1.3%)

Heuristic (Player 2):
 Wins:    987 ( 98.7%)

Draws:        0 (  0.0%)

Total:     1000 games
========================================

That simple rule makes all the difference, and all of a sudden, playing randomly is an almost sure defeat.

The minimax agent

I (vibe)coded the heuristic agent in January, but I didn’t continue the post until February, when the new Claude Opus 4.6 and GPT-5.3-Codex were simultaneously released. So the new code comes from these two models (mostly GPT).

My next prompt:

Add a minimax agent to use as baseline for future Q-Learning agents (like the heuristic agent)

Minimax is an algorithm that can be used in two-player games with perfect information. The basic idea is to build a tree with all possible moves. If I’m the first player (the maximizer), at the beginning of the game, the board is empty, and that is the root of the tree.

From there, we get seven edges, one for one of my possible moves, that end in nodes corresponding to the state of the board after that move (a piece at the bottom of the selected column). From each node grow seven edges with the possible moves of my opponent (the minimizer), and so on.

This calculation repeats until all edges get to a terminal state (win, lose, or draw), or we get to a predefined evaluation limit (we can build the whole tree because it is computationally too expensive)

Once we have the tree, we compute a score for each leaf (the terminal nodes), and we choose one of them: if the edges getting to the nodes correspond to moves for the adversary (minimizer nodes), we choose the lowest score, if to my moves (maximizer nodes), we choose the highest one. The reason is that we’ll compute the score so that the higher the score, the better for the maximizer player, and we assume the minizer we’ll choose the move we evaluate as the worst for us.

Example image tree. Source.

And how do we compute the score? We also use a heuristic for this. The one Codex chose (the idea seems logical, the detail may be suboptimal, or maybe not) is to add points based on how many pieces in a row you have, with an additional score based on how central your pieces are. This is how Claude explains the evaluation function:

Claude in Cursor

The application lets you choose the depth of the tree, four by default. It works, but it is a little slow. I imagine that I should be using all 24 cores for this, so I prompted Codex again:

Make the benchmark.py script multithreaded so it can run several games at the same time

When I tried the first implementation, the result was way worse than the single-thread version. So I tried again:

There is some problem with the multithreading. Now the execution takes way longer. Anaylize the code with detail and identify the issues.

It finally creates a multithreaded version that improves speed.

The Minimax agent beats the heuristic, although not for that much:

Select Player 1 (X):

 1. Random
 2. Heuristic
 3. Minimax

Enter choice: 2

Select Player 2 (O):

 1. Random
 2. Heuristic
 3. Minimax

Enter choice: 3
Select Player 2 (O) minimax depth [4]:

Number of games [1000]: 10000
Workers [24]:

Running 10000 games on 24 workers: Heuristic vs Minimax(d=4)

 4800/10000 games completed
 7400/10000 games completed
 10000/10000 games completed

========================================
Results
========================================

Heuristic (Player 1):
 Wins:   3904 ( 39.0%)

Minimax(d=4) (Player 2):
 Wins:   4818 ( 48.2%)

Draws:     1278 ( 12.8%)

Total:    10000 games
========================================

This version of the code is here.

Reinforcement Learning II - Basic algorithms

2026-02-17T00:00:00+00:00

With the basic engine created, the next step is adding a couple of basic non-RL agents.

The heuristic agent

The first agent will follow a simple heuristic: make the most central move possible. For this, I wrote this prompt:

Add a new “heuristic” agent. The agent:

if it has a winning move, it plays it

if the opponent has one winning move, it blocks it

if a move gives the opponent a winning move, it avoids it

for the rest of the cases, it prioritices central moves, the more in the center the better

for ties, choose a random move

It implements it without a hitch. I test it against the random agent, playing the heuristic as the second player:

Running 1000 games: Random vs Heuristic

 100/1000 games completed
 200/1000 games completed
 300/1000 games completed
 400/1000 games completed
 500/1000 games completed
 600/1000 games completed
 700/1000 games completed
 800/1000 games completed
 900/1000 games completed
 1000/1000 games completed

========================================
Results
========================================

Random (Player 1):
 Wins:     13 (  1.3%)

Heuristic (Player 2):
 Wins:    987 ( 98.7%)

Draws:        0 (  0.0%)

Total:     1000 games
========================================

That simple rule makes all the difference, and all of a sudden, playing randomly is an almost sure defeat.

The minimax agent

My next prompt:

Add a minimax agent to use as baseline for future Q-Learning agents (like the heuristic agent)

Example image tree. Source.

Claude in Cursor

The application lets you choose the depth of the tree, four by default. It works, but it is a little slow. I imagine that I should be using all 24 cores for this, so I prompted Codex again:

Make the benchmark.py script multithreaded so it can run several games at the same time

When I tried the first implementation, the result was way worse than the single-thread version. So I tried again:

There is some problem with the multithreading. Now the execution takes way longer. Anaylize the code with detail and identify the issues.

It finally creates a multithreaded version that improves speed.

The Minimax agent beats the heuristic, although not for that much:

Select Player 1 (X):

 1. Random
 2. Heuristic
 3. Minimax

Enter choice: 2

Select Player 2 (O):

 1. Random
 2. Heuristic
 3. Minimax

Enter choice: 3
Select Player 2 (O) minimax depth [4]:

Number of games [1000]: 10000
Workers [24]:

Running 10000 games on 24 workers: Heuristic vs Minimax(d=4)

 4800/10000 games completed
 7400/10000 games completed
 10000/10000 games completed

========================================
Results
========================================

Heuristic (Player 1):
 Wins:   3904 ( 39.0%)

Minimax(d=4) (Player 2):
 Wins:   4818 ( 48.2%)

Draws:     1278 ( 12.8%)

Total:    10000 games
========================================

This version of the code is here.

Running a multimodal embedding model for image classification

2026-02-07T00:00:00+00:00

I had the idea of creating an application to classify a big set of images at work. I wasn’t fully convinced that it was going to work because it required processing a couple of hundred thousand images, but it was worth a try.

The technical approach

I’m using an OpenCLIP model specifically trained in the domain I’m interested in. OpenCLIP models are dual-encoder architectures: one image encoder and one text encoder, with both encoders mapping their inputs into a shared embedding space. They are trained with contrastive loss. Given a batch of image–text pairs, the model produces normalized embeddings:

vᵢ = image_encoder(Iᵢ)   ∈ ℝᵈ
tᵢ = text_encoder(Tᵢ)    ∈ ℝᵈ
||vᵢ|| = ||tᵢ|| = 1

A similarity matrix is built over all image–text combinations in the batch:

Sᵢⱼ = vᵢ · tⱼ

where τ is a learned temperature, cross-entropy loss is applied in both directions (image→text and text→image) to maximize similarity for matched pairs and minimize it against all other in-batch pairs. After training, the model supports zero-shot image–text matching via embedding similarity.

This illustration from the OpenAI original paper makes a good representation of the process:

Original text: Figure 1. Summary of our approach. While standard image models jointly train an image feature extractor and a linear classifier to predict some label, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training examples. At test time, the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the target dataset’s classes.

The code for the paper is in this repo. There are lots of available models derived from this architecture, some of which can be found on HuggingFace.

The application

All the code for the application was written by either Claude Code or OpenAI Codex, using Gemini to double-check some things. I took the role of a requirements analyst (this was a thing in the past), and just looked at the code in critical sections (database access, credentials management).

For the first step, we scale the image to 224px in the smaller dimension and center-crop it to make the image square, as the model expects. After that, we use a producer-consumer pattern to retrieve images and process them in batches through the model. We store the resulting embeddings in a parquet file (good enough!) with the image id.

For the classification step, we need a list of prompts describing each category in different ways. I used three or four descriptions for each category, and even tried with negative descriptions, even though I don’t think they helped. I also added some neutral descriptions to act as a threshold.

For each set of descriptions, we calculate the normalized mean:

 embeddings = encode_texts(model, tokenizer, prompts, device)
 mean_emb = embeddings.mean(axis=0) # axis=0 calculates the mean of each feature based on the value of that feature for each of the embedding vectors
 mean_emb = mean_emb / np.linalg.norm(mean_emb)

The algorithm for the classification calculates the cosine similarity between image embeddings and description embeddings, and decides based on a score. Something like this:

for img_emb in image_embeddings:
  img_categories: list[tuple[str, float]] = []
  neutral_sim = float(img_emb @ neutral_embedding)

  for title, (pos_emb, neg_emb) in category_embeddings.items():
    pos_sim = float(img_emb @ pos_emb)
    neg_sim = float(img_emb @ neg_emb)
    score = pos_sim - neg_sim
    threshold = category_thresholds.get(title, score_threshold)

    # Match if score exceeds threshold and positive beats neutral by margin.
    if score >= threshold and pos_sim >= neutral_sim + neutral_margin:
      img_categories.append((title, score))

There is a lot to improve in this exercise, but the overall results were positive. I will probably try one or two different approaches to solve this same problem, but at least I know that there is one thing that mostly works.

Reinforcement Learning I - Connect 4

2026-01-25T00:00:00+00:00

I’m going to vibe code a Connect 4 game. Not because I specially like Connect-4, but because Connect-4 looks like a good game to learn Reinforcement Learning: the world is the board, the agent is the player, and the actions are the plays. The final reward is, of course, winning the game. It isn’t trivial like tic-tac-toe, but it is orders of magnitude smaller than chess or go.

I will try not to touch a single line of code or documentation. Everything will be created by Claude Code or OpenAI Codex (depending on my mood), and committed as is to this public repository. I’m reusing an old repo from a previous non-vibe coding attempt. I removed all the code, but not the previous commits.

Vibe coding

After cloning the repo and launching Claude Code, I gave it the following prompt (typos included; most prompts contain typos and other errors because English is not my native language, and I didn’t want to spend time double-checking or using Grammarly, as I did for this text):

Write a game engine for a connect-4 game. The engine should contain only the logic for setting moves, checking results, tracking who should play, but no graphical representation and no algorithm to decide the best move. Only an interface to check status, heck current player, and play a move. The engine will be used to train different Reinforcement Learning models in the future.

The result looks good, but all the code is in just one file, and is full of comments. My preference is having multiple small files with specialized code, and relying on naming for documentation, using comments only for complex logic. So my next prompt is:

Refactor the code with the following goals:

Small files or classes, with very limited scopes: for example, game and board should be different classes. Try to make classes and modules as small in scope as possible, while still being useful.

Remove comments unless there is complex logic to explain. With limited scope, method and variable names should be self-explaining.

The result is in this commit. I think the code is ok. It includes things like this:

    def get_state_flat(self) -> np.ndarray:
        return self._board.get_grid().flatten()

This code is unnecessary now, but it will be useful to train the models.

Next step is updating the documentation. I’m instructing Claude to use an AGENTS.md instead of its default CLAUDE.md, to make it more model-agnostic (no prompt without its typo):

Update the documentation. README.md with a brier summary of the app and instructions to test and run. AGENTS.md with full information for LLM agents.

Not bad

Ok, now we need someone to play against. The baseline model will be one player making random moves. Any future model should perform better than random.

Add the first algorithm, a model that just plays a random move. The random algorithm will just choose one of the legals move at random.

Code looks good, but we need a way to test it. I want to play against any model, so we add code to do so:

Now add a new module to play a game. The program will show the available agents to play with, and the player will select one, as well as the option to be player one or two. The game will be shown in the console, where the player will press the keys 1 to 7 to select the move.

It works!, but it is slightly confusing because columns are named 0-6, and you don’t know which pieces are yours. So let’s fix it:

Modify the board to show columns as 1-7 instead of 0-6. Also, show a legend with the symbol for the player (X or O)

Done! Now it looks good:

Exciting game!

The final step is creating a way to test the models by playing multiple games:

create a module to test models. The module will run two models against each other, one with white and the other with black, or even the same model against itself. The module will ask for the models to play and the number of games (1000 by default), and will show the final results for wins, loses, and draws both in number and percentage

Nicely done!

The first player has a slight advantage in the game by playing first, and that shows in the percentages.

This is the final version for today. In total, I spent around half an hour on this code, and all the prompts I used are the ones in this post.

Training the models should be more challenging. We’ll see.

Rough Notes

Testing Qwen3-Coder-Next - II

Running the models

The test

The solutions

Qwen3-Coder-Next

Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2

The evaluations

Summary

Testing Qwen3-Coder-Next

Using llmfit

Downloading the model

Testing the model

Using Telegram to control Claude

Testing channels with Telegram

Old man tries things, Nano Banana edition

** Update **

Original text

Writing an agentic conversational game - II

The implementation

Writing an agentic conversational game - I

The idea

The plan

The review

Reinforcement Learning II - Basic algorithms

The heuristic agent

The minimax agent

Reinforcement Learning II - Basic algorithms

The heuristic agent

The minimax agent

Running a multimodal embedding model for image classification

The technical approach

The application

Reinforcement Learning I - Connect 4

Vibe coding

Using `llmfit`

Update