Some Creativity

Stop Building State Machines for Your AI Agents (use Durable Functions instead)

Sid — Wed, 25 Feb 2026 06:31:15 +0000

I built a sample that I think captures something important: AI agents that interact with the real world need workflows that pause, and Durable Functions make this much easier than current alternatives.

The Problem

Say you’re building a support agent. A customer asks for a refund. The agent can look up the order, check the return policy, and decide a refund is warranted — but it can’t just issue the refund. A human needs to approve it.

User in Teams:

Supervisor Dashboard:

So now you need to:

Save the pending request somewhere
Pause the workflow
Wait for a supervisor to approve or reject (could be hours or days)
Resume exactly where you left off
Process the refund and notify the customer

The typical approach? A state machine. You model every state (pending_approval, approved, processing, completed), every transition, and wire up polling or webhooks to detect when things change. You write a bunch of glue code to serialize context, handle edge cases, and coordinate between services.

It works. It’s also tedious, error-prone, and obscures what’s actually a simple workflow.

The Durable Functions Approach

Let’s start with the diagram. A customer asks the bot for a refund. The bot uses AI to look up the order, creates a case, and starts a Durable Functions orchestration that pauses until a supervisor approves or rejects it. Once approved, the orchestrator processes the refund and notifies the customer, all without polling or a state machine.

Here’s the entire approval workflow in my sample:

			
export const supportCaseOrchestrator: OrchestrationHandler = function* (context) {
  const { caseId, action } = context.df.getInput();
  // Mark as pending
  yield context.df.callActivity('updateCase', { caseId, status: 'pending_approval' });
  // Wait for a human — costs nothing while paused
  const approvalTask = context.df.waitForExternalEvent('Approval');
  const timeoutTask = context.df.createTimer(sevenDaysFromNow);
  const winner = yield context.df.Task.any([approvalTask, timeoutTask]);
  if (winner === approvalTask && approvalTask.result.approved) {
    yield context.df.callActivity('updateCase', { caseId, status: 'approved' });
    if (action === 'refund') {
      yield context.df.callActivity('issueRefund', { caseId });
    }
    yield context.df.callActivity('notifyBot', { caseId, message: 'Approved!' });
  } else {
    yield context.df.callActivity('updateCase', { caseId, status: 'rejected' });
    yield context.df.callActivity('notifyBot', { caseId, message: 'Rejected.' });
  }
};

		

That’s it. Read it top to bottom: it’s just the workflow. No state machine. No polling. No webhook plumbing. The orchestrator pauses at waitForExternalEvent, serializes its state, and stops executing entirely.

When a supervisor clicks “Approve” in the dashboard, the dashboard calls the Durable Functions HTTP API with:

raiseEvent('Approval', { approved: true })

passing the case ID. The framework matches this to the paused orchestration instance, deserializes its state, and resumes execution from the exact yield where it was waiting. The orchestrator then runs the remaining steps — update the case, process the refund, notify the customer — as if no time had passed.

Key: waitForExternalEvent costs nothing while waiting. No process running. No timer ticking. No compute billed. Each customer’s case gets its own orchestration instance, waiting independently.

Why This Matters for AI Agents

As we build agents that do more than just answer questions, agents that take actions, trigger workflows, and interact with external systems, we’re going to hit this pattern constantly:

Refund approvals: agent submits, human approves
Deployment requests: agent prepares a change, human confirms
Escalations: agent triages, human takes over
Multi-step processes: agent starts, waits for external data, continues

Every one of these is a “pause and wait” problem. You could solve each one with a state machine, a database, and some glue code. Or you could write the workflow as a straight-line function and let the infrastructure handle the rest.

What About the Alternatives?

Approach	How it works	Why it hurts
Polling loop	Bot checks a “pending” flag in a database every N seconds	Wastes compute. 1,000 pending cases = 1,000 polling loops. Latency depends on poll interval.
Queue + worker	Bot writes to a queue; worker picks up after approval	You build the state machine yourself: track which step each case is on, handle retries, deal with poison messages. “Wait for approval” doesn’t map naturally to a queue.
Webhook callback	Bot registers a callback URL; approval service calls it	Bot must be running when the callback arrives hours later. If it restarts, the callback URL may be stale. No built-in retry or state tracking.
Database + cron	Store pending cases in DB, cron job checks for approved ones	Same polling problem. Cron frequency = latency floor. State machine lives in application code. Error handling is manual.
Durable Functions	`waitForExternalEvent` pauses at zero cost; `raiseEvent` resumes instantly	Requires Azure Functions runtime. But: no polling, no state machine code, built-in retry, scales to thousands of concurrent cases.

Durable Functions win here because:

Zero-cost waiting: a case pending for 3 days uses no compute until approved
No state machine: the orchestrator reads like a sequential function, but the framework handles checkpointing, replay, and fault tolerance
Parallel independence: Alice’s refund and Bob’s escalation are separate instances; approving one doesn’t affect the other

The Full Sample

The durable-support-agent sample has three pieces:

A Teams bot that uses GPT-4o with tool calling to handle customer support — order lookups, knowledge base search, refund requests, escalations
Azure Durable Functions that orchestrate the approval workflow with zero-cost pausing
A Next.js dashboard where supervisors approve or reject pending cases

The whole thing runs locally. The bot creates cases, the orchestrator pauses, the dashboard lets you approve, and the customer gets notified, all coordinated through a workflow you can read in 30 lines.

If you’re building agents that need human-in-the-loop workflows, give Durable Functions a look.

Learn More

Azure Durable Functions overview — what they are and how they work
Human interaction pattern — the exact pattern used in this sample (waitForExternalEvent + raiseEvent)
Durable Functions for JavaScript/TypeScript — quickstart for the Node.js SDK
Orchestrator function constraints — rules for deterministic replay (important to understand before writing orchestrators)
Timers in Durable Functions — how createTimer works for timeouts and deadlines
durable-support-agent sample — the full source code for this post

Giving OpenClaw Its Own Identity, And a Sandbox to Run In

Sid — Mon, 16 Feb 2026 08:35:40 +0000

OpenClaw is an open-source AI agent framework that gives LLMs real tools and autonomy. It already has a built-in Teams channel, but it works as a traditional bot using delegated auth, meaning the agent acts with your permissions.

OpenClaw A365 takes a different approach. Instead of a bot wearing your credentials, it gives the agent its own identity in your Microsoft 365 tenant, sandboxes its runtime, and makes every action observable to IT – all while extending its reach beyond Teams to Outlook, Word, Excel, and PowerPoint.

Two things I kept coming back to while investigating OpenClaw:

1. Agents need their own identity, not yours.

Traditional bot frameworks use delegated auth — the agent acts as you, with access to everything you can see. That’s terrifying when the agent can reason and take actions autonomously, especially as they get more capable.

With A365’s agentic-identity model, the agent gets its own Entra ID account (e.g. agent@contoso.com). You share a resource, like a calendar, with it like you would a colleague. It only sees what you’ve explicitly granted.

(See demo video)

Audit logs and the Observability stack show the agent acted, not you via some app. This is how trust should work.

2. If an agent can run code, you need to control what it can reach.

OpenClaw agents can generate and execute code, including network requests. OpenClaw A365 enforces network policy at the container level via iptables. You choose: unrestricted, locked down to Microsoft + your LLM provider, or a custom allowlist. The agent cannot call a domain you haven’t approved.

Combining a real identity with least-privilege access and a sandboxed runtime gets us closer to highly autonomous agents that are still observable, governable, and safe to deploy in the enterprise.

Why this matters

Agent 365 was released in preview to Frontier customers last November at Ignite. It was a super-intense push for me, my team, and many others across the company. Back then, we didn’t know that an agent framework like OpenClaw would arrive and make it obvious to everyone why agents need their own identities, sandboxed runtimes, and observability.

The fact that the platform was already there waiting speaks to the foresight Microsoft had. Hope to see Google and other identity providers follow suit.

Links

Demo video: https://youtu.be/7uD2vyfBUUs

GitHub: https://github.com/SidU/openclaw-a365

GoodDocs

Sid — Sat, 27 Dec 2025 09:54:15 +0000

Many of the docs we write exist to help teams make better decisions by writing down the thinking and reviewing it with others. With AI, it is easier than ever to generate a doc from a few words of a prompt, but when a draft looks “done” too quickly, important context and key aspects can get skipped. That is why teams often adopt doc templates: they force the right questions to show up every time.

There are real benefits to standardized doc formats when you work with many people. A consistent template reminds you of the things you missed before and trains the team to avoid repeating past mistakes. It keeps everyone aligned on what needs to be answered, makes reviews dramatically faster, and helps new teammates find what they need without decoding each author’s personal style.

The downside is toil: filling in every section takes time, which is exactly when people reach for AI and generate a draft from a few keywords. That is useful, but it can also skip critical thinking. The challenge is letting people use AI for speed while still ensuring the important parts are covered. That is where GoodDocs comes in.

GoodDocs solves that by making documentation easy to write and easy to trust, even when AI helps produce the first draft. It encourages using AI as a thought-partner and research-partner, with an additional review layer that checks for missing reasoning, while still reducing toil so doc creators can focus on shipping real improvements for customers and business impact.

We already have all the pieces: GitHub for storage and version control, GitHub Actions on PRs to run validation automatically, Codex/Claude-Code/GitHub Copilot CLI as the orchestration and review layer, and VS Code/Cursor as the editor. GoodDocs brings those parts together into a single, lightweight system for structured docs.

How to use it

Setup by:
- Creating a repo using GoodDocs as a template.
- Cloning your repo locally.
Initialize the repo defaults with make init. This is a 1-time step.
Run Codex/Claude-Code/GitHub Copilot CLI in your repo in terminal.
Create a new doc with make new-doc, then draft it using the $doc-author skill.
Edit your doc using your favorite editor, filling out all the sections.
Share your doc by opening a PR from your branch.
Validation runs automatically, and optional LLM review can run when enabled (ensure OPENAI_API_KEY is set in https://github.com/your_account/repo/settings/secrets/actions)

Example

This repository includes a complete example document at docs/example/0001-example.md. It follows the template, passes validation, and shows the expected level of detail across sections like Motivation, Proposed Solution, and Alternatives & Open Questions.

You can view a sample PR where the LLM left template-based review comments here.

How to customize it

(You want to do this to get the real value out of this)

You can tune GoodDocs to match your org. In your repo that you created from GoodDocs as a template:

Update templates/doc-template.md to change the doc format, and edit schema/doc_rules.json to adjust validation rules, required sections, or quality heuristics. If you need multiple doc types, add new templates and doc type entries so each format has its own rules and folder.

Common customization examples and why they help:

PRDs to capture customer context, success metrics, and rollout plans in a consistent way.
Dev design docs / RFCs to force clarity on trade-offs, API contracts, and migration plans before code is written.
Decisions (ADR-style) to keep a durable record of why a choice was made and what alternatives were considered.
Operations / incident playbooks to standardize escalation, post-mortem learnings, runbooks, and recovery steps.
Compliance or security reviews to ensure required checks are documented and auditable.

Controlling AI Agent Participation in Group Conversations (Koala)

Sid — Sat, 22 Nov 2025 16:01:24 +0000

Last Friday, we had the opportunity to hear from Justin Weisz, Stephanie Houde, Steven Ross, and the IBM team about their research on controlling AI agent participation in group conversations. They ran a set of studies with a Slack bot called Koala to understand how an agent should behave in live multiparty brainstorming sessions. Read on for what they found. Their results are important for how we think about designing agents in collaborative spaces like Teams.

Koala

They built an LLM based conversational-agent prototype called Koala for Slack as a bot.

They ran two studies with Koala to measure its impact on brainstorming, using the findings from Study 1 to refine and evolve the agent for Study 2.

Study Setup

Same groups tested across:
1. No AI
2. Koala Reactive (responds when addressed) via mention
3. Koala Proactive (decides when to speak)
Tasks: 3-min brainstorming -> pick top 3 ideas.

High-level Findings

Everyone preferred having Koala over no AI
- Shows everyone appreciated having an agent while brainstorming
Strong preference for Reactive over Proactive in v1.
Koala contributed 73% of all ideas; 33% of top ideas.
(Takeaway: AI boosted volume and quality.)

Advantages (from Study 1)

Removes “white page” problem; helps groups start.
Speeds up brainstorming.
Adds structure; pseudo-moderator.
Summaries keep the group on track.
Validates user ideas.
Fills knowledge gaps.
Visible human-AI collaboration sparks more ideas.

Disadvantages (from Study 1)

Proactive mode = distracting, intrusive, overwhelming.
- Too long, too frequent, wrong timing.
- “Dominated the conversation.”
Stifling effect (“boxed myself in,” production blocking).
Inaccurate / hallucinated summaries.

What Participants Wanted

Control over when, how often, and how much Koala contributes.
Ability to steer behavior mid-conversation.
Combine reactive + selective proactive behaviors.
Agent should wait when humans are actively typing.
Option to ask permission before interjecting (“Want me to share top 3?”).

Koala II Improvements

Model upgrade to Llama 3 led to fewer hallucinations, longer context.
Prompt updates: more targeted suggestions, less domination.
Tunable “value threshold” for proactivity.
UI control panel:
- Reactive vs proactive toggle.
- Proactive contribution threshold (High / Medium / Low).
- Where messages appear: in-channel vs thread.
- Long-message truncation.

Basically, give users the option of choosing how Koala should interact, allow it be steered on how to respond via a message mid-conversation, and pre-built persona selection.

Study 2: Results

Koala II perceived as quieter, better paced, more on-topic.
Felt more natural and less interruptive.
Big reversal:
- No group switched from Proactive to Reactive.
- When tuned, people preferred the improved Proactive version.
Threaded replies were a failed expectation (this surprised me initially, but makes sense):
- People thought it would reduce noise, but it worsened collaboration.
Tone complaints: Koala II occasionally too human (“That’s a great idea!”).

Three groups tried the option of having Koala II respond in thread rather than in channel, thinking it would reduce their distraction from Koala II. Surprisingly, it had the opposite effect. P1.1 explained how it took time to “look through everyone’s threads… taking away from our collaboration.” Many other participants made similar comments, suggesting that threaded replies may not be suited to the real-time nature of a brainstorming task.

User Control Insights (Study 2)

Controls rated highly useful (avg 4.46/5).
People want to change settings dynamically during the session.
Different tasks → different proactivity levels.
Natural-language steering is attractive but risky (misinterpretation, pollutes conversation).
Roles and personas were preferred as high-level modes, but users still want low-level knobs.

Social + Governance Findings

Adjusting AI settings inside a group is socially sensitive:
- Users felt “intrusive” making unilateral changes.
- But small teams were more accepting.
Possible needs:
- Admin roles
- Voting on behavioral changes
- Visibility of changes

Taxonomy of Control (Paper’s Main Contribution)

When the agent contributes
- Triggers (all messages, direct address, silence, bursts of activity).
- Filters (value threshold, relevance).
- Rate (delay, pacing, matching human cadence).
What the agent contributes
- Content type (conservative vs wild ideas).
- Style (tone, length, enthusiasm, formatting).
- Modality (text, emojis, images, etc.).
Where the agent contributes
- In channel vs thread.
- Future: other UI surfaces depending on context.
How behaviors are specified
- UI controls.
- Natural language steering.
- High-level roles.
- Personas.
- Granularity control (coarse vs fine).
Who can change the settings
- Permissions, visibility rules, group norms.
Implementation
- Prompt engineering.
- External logic (needed because LLM self-regulation is unreliable).
- Real-time control mechanisms, not static presets.

Key Design Insight

Proactivity is not binary. It is multi-dimensional and must be dynamically adjustable by the group.
No single “best” setting; ideal behavior depends on:
- group preferences
- moment-to-moment context
- stage of collaboration

Going forward, explore next..

Personalized AI behavior in collaborative settings.
Context-aware proactivity (detect active human exchange, detect pauses).
Allow different groups/situations to choose different behavior patterns.
The right approach: a configurable system, not a fixed algorithm.

Inner Thoughts – Notes

Sid — Sat, 15 Nov 2025 11:32:15 +0000

We had the opportunity to host Bruce Liu, one of the authors of the Inner Thoughts paper, in our team’s AI learning session today. Sharing my key takeaways.

Key Takeaways

Giving an agent a persona and having it run a continuous internal monologue leads to more natural participation in group conversations.
The system generates multiple candidate thoughts, evaluates them on:
- relevance
- information gap
- impact
- appropriateness
  … and only expresses a thought if motivation passes a threshold.
This makes the agent selective, not reactive. It avoids over-speaking and feels socially aware.
The authors fine-tuned GPT-3.5 on the MPC (Multiparty Chat Corpus) dataset to predict the next speaker, and prompt the model to generate response based on it’s persona if selected by the prediction. They compared the Inner Thoughts approach against this baseline.
- Dataset: https://github.com/sashank06/MPC-Corpus
- Paper: http://www.lrec-conf.org/proceedings/lrec2010/pdf/85_Paper.pdf
The overall loop is:
- Trigger – Initiating the thought process (when someone posts a message or silience threshold in this paper)
- Retrieval – Accessing relevant memories and context
- Thought Formation – Generating potential thoughts
- Evaluation – Assessing intrinsic motivation to express thoughts
- Participation – Deciding when and how to engage in conversation
The important idea: Not every thought should be spoken.
Another interesting idea was that they used different prompts to simulate System-1 vs System-2 thinking (thinking fast-and-slow) to generate thoughts.
- They use a simple developer-set probability to choose between fast System-1 thoughts and slower System-2 reasoning, but this idea opens the door to far more sophisticated, context-aware switching.
The agent behaves more like a participant in the conversation, not a tool that gets invoked when @ mentioned.
The code is clean and packaged well: https://github.com/xybruceliu/thoughtful-agents
- Actually, should be very tractable to use it inside a Teams SDK agent for Python.

MUCA – Notes

Sid — Sat, 08 Nov 2025 01:22:52 +0000

In today’s AI Learning session, we had the opportunity to meet Manqing Mao and Jianzhe Lin who co-authored MUCA. Capturing my notes here. There are several interesting ideas in the paper that are applicable to multi-human <-> agent collaboration.

Multi-User Chat Assistant (MUCA): Framework for LLM-Mediated Group Conversations

MUCA targets multi-user, single-agent interactions — a challenging setting where a chatbot must reason not only about what to say but also when and to whom. The system operationalizes these through the 3W design dimensions:

What – selecting relevant content that advances the discussion or resolves conflicts.
When – determining optimal response timing to balance engagement without interruption.
Who – identifying the intended recipient(s) of the response within a group context.

Together, these govern a chatbot’s role as a supportive and context-aware participant in group discussions, rather than a turn-taking speaker responding to each message individually.

Core Modules

Sub-topic Generator
Initializes structured sub-topics from the conversation goal, agenda, or hints, enabling MUCA to guide discussions along coherent and logically connected threads rather than reacting opportunistically to each message.
Dialog Analyzer
Continuously interprets conversation state through several sub-modules:
- Sub-topic Status Update – tracks whether topics are not discussed, being discussed, or well-discussed, providing situational awareness.
- Utterance Feature Extractor – identifies which sub-topics are active within the current window, crucial for managing multi-threaded discussions.
- Accumulative Summary Update – maintains rolling summaries per participant to preserve long-term conversational context efficiently.
- Participant Feature Extractor – quantifies engagement (frequency, length, and focus of contributions) to detect lurkers or dominant speakers and inform adaptive participation strategies.
Utterance Strategies Arbitrator
Selects one of seven dialog acts, ranked by heuristic confidence and contextual triggers, to determine MUCA’s next move. Each act has trigger conditions, warm-up, and cool-down turns to manage pacing:
- Direct Chatting: Respond immediately when pinged directly.
- Initiative Summarization: Periodically generate concise summaries to improve shared understanding.
- Participation Encouragement: Invite quieter participants to contribute using gentle, personalized prompts.
- Sub-topic Transition: Detect when a topic is exhausted or stale and guide the group to a new one.
- Conflict Resolution: Summarize opposing views and propose synthesis or consensus paths.
- In-context Chime-in: Contribute timely insights or clarifications when conversation flow stalls or questions remain unanswered.
- Keep Silence: Default behavior to avoid over-participation when no act is warranted, preserving conversational balance.

Design Challenges Addressed

Stuck Conversation Advancement: Detects stagnation and injects contextually appropriate insights to re-ignite progress.
Multi-threaded Discussion Management: Tracks overlapping topics and participant clusters to sustain coherence in complex group exchanges.
Responsiveness Requirement: Maintains timely yet non-intrusive responses despite asynchronous, high-traffic chat environments.
Participation Evenness: Uses data-driven engagement metrics to encourage balanced contributions across users.
Conflict Resolution: Applies summarization and consensus-seeking acts to mediate disputes or align diverging viewpoints constructively.

Key Contribution

MUCA provides the first structured framework enabling LLMs to function as facilitators in group settings. By uniting the 3W dimensions, a modular analysis pipeline, and dialog-act arbitration, it transforms large language models from reactive responders into proactive conversation participants capable of maintaining context, inclusivity, and flow in multi-participant discussions

Paper: https://arxiv.org/pdf/2401.04883v1

Embeddings & Similarity Metrics

Sid — Sat, 27 Sep 2025 22:08:26 +0000

When asked what embedding model and similarity metric they’ve used, most people answer something like: “OpenAI embeddings with cosine similarity.”

That’s a perfectly valid answer. But it leads to deeper questions:

What if you’re working with an open-source embedding model like BERT-base or MiniLM-base? Can you still use cosine similarity?
What if you come across code that’s using Euclidean distance with OpenAI embeddings — is that wrong?
Are there scenarios where Euclidean distance is actually better?
Do recommendation systems have different considerations than RAG systems?

These were some of the questions we dug into in our team learning session last Friday. Let’s walk through the key takeaways.

First: the difference between Euclidean distance and cosine similarity

At a glance both compare vectors, but they focus on different things:

Euclidean distance: compares the endpoints of the vectors. It’s the straight-line distance between two points.
Cosine similarity: compares the directions. It measures the angle between vectors, ignoring how long they are.

Euclidean distance

For simplicity’s sake, let’s take two vectors and drawn from the origin. The Euclidean distance between them is just the straight-line distance between their endpoints (the tips of the arrows). If you put a ruler between the tips, that’s the number you’d get.

Matematically:

This makes it clear why length matters here: even if two vectors point in almost the same direction, if one is much longer, the distance between their endpoints will still be large.

Cosine similarity

While Euclidean distance looks at the endpoints of vectors, cosine similarity only looks at their direction. Imagine projecting every vector onto the unit circle: cosine measures how close those directions are, regardless of how long the arrows are.

Mathematically:

Here is the dot product and is the angle between the two vectors. The lengths and cancel out, which is why cosine similarity is independent of vector magnitude.

If the angle is 0° (vectors point the same way), cosine = 1 → perfectly similar.
If the angle is 90° (orthogonal), cosine = 0 → no similarity.
If the angle is 180° (opposite directions), cosine = –1 → maximally dissimilar.

Visually: even if one arrow is much longer, if they point in the same direction their cosine similarity is still 1.

The intuition with three vectors

Imagine three vectors: A, B, and C:

As you can see, B is more aligned in direction with A than C is.
With Euclidean distance, A–C (~ 9.2) looks closer than A–B (~ 13.5) because C’s tip is nearer to A’s tip, even though the angles are different.
With cosine similarity, A–B wins, because alignment (angle) matters more than raw length.

This is exactly the situation where Euclidean and cosine will disagree on ordering. So, this is why you need to be mindful of your choice of the comparison metric.

Why normalization matters

A common trick is to normalize vectors so their length is 1 (i.e., put them on the unit circle or unit sphere). The math looks like this:

Basically, take the vector and divide each component by its length.

When both vectors are normalized, this distance is just another way of measuring the angle between them — which is exactly what cosine similarity does.

So, in our example, after normalization B is closest to A, followed by C – with both cosine similarity and Euclidean distance.

OpenAI embeddings already come normalized. Even though most people use cosine similarity without a second thought, even if you use Euclidean distance with them, you’ll get the same neighbors as cosine similarity — the rankings are identical.

When magnitude matters: why not always normalize?

It’s tempting to think you should always normalize embeddings and stick to cosine similarity. After all, that’s what most semantic search and RAG systems do. But normalization isn’t always the right move, because sometimes the magnitude of the embedding carries meaning.

Remember, the dot product between two vectors is:

That means it encodes both alignment (the angle) and magnitude (the length of each vector). If length itself encodes a signal you care about, dot product or Euclidean distance can be the right tool, while cosine would wash that information away.

Examples:

Number of views on a video – a 10,000-view video might need to be treated differently from a 100-view video, even if the content is otherwise identical.
Price of an item – if embeddings include “price” as one axis, Euclidean distance will reflect a real dollar gap ($499 vs. $1,999), not just semantic similarity.
Quantity sold / demand – embeddings that include sales volume should allow high-demand items to naturally stand apart from slow movers.
User activity level in recommendations – in collaborative filtering systems, highly active users often have embeddings with larger norms. Dot product/Euclidean distance naturally lets that popularity signal influence similarity scores.

In practice, large-scale recommendation systems have successfully leveraged this property. For example, Yahoo’s Prod2Vec approach (Grbovic et al., 2015) applied Word2Vec-style training to user interaction sequences. They found that the resulting embeddings captured not only “semantic” relations between products, but also popularity and frequency effects in the vector norms which were signals that were directly useful for recommendations.

So, you might think: does this mean I don’t have to worry about Euclidean or dot product in RAG systems? The answer is: usually not. But, here’s the fun part: most vector databases (FAISS, Pinecone, Weaviate, Milvus, etc.) implement cosine similarity by normalizing embeddings once and then using dot product internally. Why dot product? Because once embeddings are normalized, dot product works the same for ranking as Euclidean, but is faster to compute.

My own small experiment, described below, confirmed this: Dot product was slightly faster than Euclidean on normalized embeddings (~1.1× speedup in my run), since it’s just multiply-and-sum with no subtractions/squares.

After normalization, cosine and Euclidean gave identical nearest-neighbor rankings.

THE Experiment

Generated ~20,000 database vectors and 200 query vectors with an embedding size of 384 (roughly what you’d get from MiniLM).
For each query, retrieved the top-K neighbors using:
1. Dot product (cosine if vectors are normalized)
2. Squared Euclidean distance

Tested both on raw vectors and on normalized vectors (so that ).

Results:

On normalized vectors, cosine and Euclidean produced identical neighbor rankings.
In terms of performance, dot product was about 1.1× faster than Euclidean on normalized embeddings. That’s because dot product is just multiply-and-sum:

While squared Euclidean requires subtracting, squaring, and adding:

So Euclidean does more work per dimension, even if the final square root is skipped.
On raw (unnormalized) vectors, Euclidean and cosine gave different rankings, because vector length influences Euclidean distance but is canceled out in cosine.

Takeaways from the experiment:

After normalization, dot product, cosine, and Euclidean distance are effectively the same in terms of ranking.
Dot product is slightly faster in practice which explains why most vector databases implement cosine as “normalize once, then use dot product.”
Before normalization, you can get very different results. Euclidean reflects both angle and magnitude, while cosine reflects only angle.

Recommendation vs. RAG systems

In RAG systems, you care primarily about semantic similarity. Normalization is almost always what you want, so cosine (or normalized Euclidean) is the default.
In recommendation systems, embeddings often mix semantic and behavioral signals. Magnitude might encode popularity, confidence, or frequency. In this world, dot product or Euclidean without normalization can be useful.

Decision Tree: When to Use Which

Key Takeaways

Cosine similarity: great when direction = meaning; normalization removes scale.
Euclidean distance: great when raw magnitudes carry interpretable meaning.
Normalization: turns Euclidean into cosine for ranking purposes.
OpenAI embeddings: already normalized, so Euclidean and cosine rank the same.
Good rule of thumb in selecting the best similarity metric: match it to the one used to train your embedding model.
Recommendations vs RAG: recommendations often want magnitude, RAG almost never does.

References

Context Rot

Sid — Sun, 21 Sep 2025 11:17:49 +0000

Last Friday, our learning session covered Context Rot, a paper from the Chroma vector database team on how longer inputs affect LLM performance.

They ran experiments with 18 leading LLMs, like o3, GPT-4.1, Claude, Gemini, and Qwen, on needle-in-a-haystack style questions, then measured how often the models gave the right answer.

The best way to TLDR is to just watch this ~7 minute video on YouTube:

Here are the key takeaways:

Longer context hurts: Don’t overload models with full reports or long histories. As irrelevant text piles up, even strong models miss answers. Keep inputs lean for reliable results.

Clarity of the query matters: Vague questions get worse answers in long contexts. Since you can’t rely on users to always be precise, systems must rewrite queries, for example by rephrasing them into clearer forms, mapping them to structured intent, or combining them with retrieval to anchor the request.

Distractors amplify errors: Models can be tricked by irrelevant but similar text. In compliance or legal reviews, this means they might confuse one clause for another. Systems must filter out look-alike noise.
- Use embeddings + keyword anchors together (semantic + lexical match).
- Enforce entity checks (IDs, dates, names must align).
- Apply re-ranking models to filter passages that look close but don’t directly answer.
- Train retrievers with negative samples (examples of near-duplicate but irrelevant text).

Structure of irrelevant content matters: Clean, coherent irrelevant text is more distracting than random noise. That means polished background material can actually reduce accuracy if not filtered.

Focused input beats full input: Retrieval layers or context filters that feed only what’s relevant improve both accuracy and cost. Businesses should invest in these instead of relying on raw long-context alone.

Exact repetition breaks down: Researchers asked models to simply copy long blocks of text word-for-word. If a model can’t copy long sequences reliably, it can’t be trusted to surface exact details (IDs, contract terms, medical dosages). Retrieval workflows must include verification.

So the bigger question is: should you solve context-rot as an app-developer or wait for big-labs to solve?

Big labs will keep improving the physics of long context: better positional encodings, more efficient attention, training strategies that improve decay. But those fixes won’t handle your domain specifics: which clauses in a contract matter, which patient record fields are critical, or how to enforce compliance rules. That’s squarely on app / agent developers and those investments should be durable.

Retrieval layers, query normalization, and verification pipelines will remain useful even if models get better, because they enforce governance, and add trust, and cut costs.

What may become obsolete are low-level hacks like custom chunk sizes. So, right strategy seems like not to wait. Build domain-aware context engineering now, knowing labs will lift the floor while your systems enforce the ceiling.

One strategy is to win customers who care about context-rot being solved well today, even if some of that work gets thrown away as labs improve. Those early wins give you a base to move into higher-value scenarios while competitors catch up later on the basics with less effort.

Defining “AGI”

Sid — Mon, 08 Sep 2025 05:19:58 +0000

This week, one of the papers we discussed in my team was, the spicely titled, “What The F*ck Is Artificial General Intelligence?” by Michael Timothy Bennett, which I found after hearing him on MLST (still one of my favorite podcasts, has super high signal/noise). Several interesting points came up:

It’s a western thing: Someone mentioned that the whole concept of “AGI” feels very Western. In Eastern thought, intelligence is everywhere on a spectrum. Even very simple life forms like cells demonstrate intelligence by communicating with each other. For example, cells exchange chemical signals when they meet, adapt their behavior, and coordinate responses. This broader framing aligns with Bennett’s critique of anthropocentric definitions of intelligence.
Kids: Someone pointed out how their 2-year-old can pick up concepts after just a few repetitions. That speed of skill acquisition, and doing so with very little data, is central to generalist intelligence. Bennett frames this as adaptation with limited resources, which also brings energy efficiency into the picture.
Energy: We debated whether energy cost should be part of the definition. If something burns the energy of a star to reach human-level capability, is that really AGI? Bennett argues adaptability includes both sample efficiency and energy efficiency, so by his framing it matters.
New Science: We agreed that being able to discover new science, as Bennett calls out with the “artificial scientist” framing, is a key marker of AGI. It’s more than just doing tasks; it’s also about prioritizing, experimenting, and finding new knowledge.
It’s a spectrum: There was consensus that intelligence isn’t binary but a spectrum: at the high end are systems that not only learn new skills but do so efficiently, making them “more intelligent” than others that reach the same outcome at much higher cost.
Methods: On methods, we noted that search is necessary but not sufficient—you can’t just brute-force your way through the unknown. Approximation (fitting the messy world) is also critical. Bennett calls these the two foundational tools, and points out both are inefficient in different ways.
Hybrid: The group leaned toward hybrid architectures (like AlphaGo, or more recent blends like o3 and AlphaGeometry) as the likely path forward. Bennett also highlights cognitive architectures that try to integrate perception, reasoning, and memory, exactly the kind of fusion we thought made sense.
Finally, we asked the “is GPT-5 AGI?” question. We realized how quickly the goal-posts move. If someone had shown us GPT-5 in ChatGPT just a few years ago, we’d probably have called it AGI on the spot. Bennett makes the same observation: public hype keeps redefining AGI as whatever we don’t yet have.

MCP Universe

Sid — Sun, 31 Aug 2025 21:09:16 +0000

Salesforce AI’s new MCP-Universe benchmark puts frontier models through 200+ real-world tool-use tasks. The results: GPT-5 lands at 43.7%, Grok-4 at 33.3%, and Claude-Sonnet at 29.4%.

The rest of this post breaks down why these numbers are so much lower than BFCL, what domains drag models down most, and what the findings mean for teams wiring MCP into their platforms.

TLDR:

Frontier models underperform: GPT‑5 tops out at 43.72% success, Grok‑4 at 33.33%, and Claude‑4.0‑Sonnet at 29.44%, while the best open‑source model reaches 24.68% (details in the paper).
Failures are driven by three core challenges:
- long contexts that balloon across multi‑step tool use,
- unfamiliar/underspecified tool interfaces that trigger API misuse (the “unknown‑tools” problem), and
- distraction from large sets of unrelated tools.
Simple mitigations help inconsistently:
- per‑step summarization and a pre‑task “exploration” phase yield domain and model‑specific gains but no universal lift.
Models generally follow formats well but falter on content correctness, especially on dynamic, time‑sensitive tasks.
Domain difficulty varies sharply (location navigation is uniformly hard; GPT‑5 fares best in finance and 3D).
Agent architecture matters: o3 with OpenAI Agent SDK outperforms o3 with ReAct;
Links:
- Paper: https://arxiv.org/abs/2508.14704;
- project: https://mcp-universe.github.io;
- code: https://github.com/SalesforceAIResearch/MCP-Universe.

Takeaways for teams integrating MCP:

Limit Tool Exposure: Avoid exposing LLMs to overly large or noisy tool environments. Curate and scope tool sets to minimize “cognitive” load and improve selection accuracy.
Orchestration Design Matters: Design orchestration layers that guide LLMs toward relevant tools. Consider SDK-level constraints or routing logic to reduce ambiguity.
Platform Implications: Integration strategies should account for tool density and relevance filtering. Explore tooling levers that help LLMs navigate complex tool ecosystems more effectively (constrain and route the toolset, tighten tool interfaces, shape returned data, long context growth, standardize errors, etc.)

But BFCL shows the frontier models at more than 70% accuracy?!

Berkeley Function Calling Leaderboard (BFCL) has frontier LLMs like GPT-4.5, Claude-Opus-4, and Claude-Sonnet clearing around 70% overall accuracy, so a natural question is what’s different with MCP-Universe causing the numbers to be much lower (e.g., GPT-4.5 is 70.85 on BFCL but 24.68 on MCP-Universe).

Crux is that MCP-Universe is wired into real MCP servers and long contexts, while BFCL is scoring on a curated, static dataset.

MCP-Universe leans on multi-step reasoning where small errors can snowball.

The large number of unrelated tools in MCP-Universe (to mimic real-world messiness) is another factor.

What domains did they test on?

See chart below:

Example of a task:

So, what does this mean?

You get what you measure. Now that MCP-Universe is showing frontier LLMs struggling, the developers behind those models have a clear target to chase. Expect the accuracy of real-world MCP tool calls to climb fast in the coming months.