Blog¶

July 29, 2025
5 min read

AI is a Floor Raiser, not a Ceiling Raiser

A reshaped learning curve

Before AI, learners faced a matching problem: learning resources have to be created with a target audience in mind. This means as a consumer, learning resources were suboptimal fits for you:

You're a newbie at $topic_of_interest, but have knowledge in related topic $related_topic. But finding learning resources that teach $topic_of_interest in terms of $related_topic is difficult.
To effectively learn $topic_of_interest, you really need to learn prerequisite skill $prereq_skill. But as a beginner you don't know you should really learn $prereq_skill before learning $topic_of_interest.
You have basic knowledge of $topic_of_interest, but have plateaued, and have difficulty finding the right resources for $intermediate_sticking_point

Roughly, acquiring mastery in a skill over time looks like this:

Traditional learning curve

What makes learning with AI groundbreaking is that it can meet you at your skill level. Now an AI can directly address questions at your level of understanding, and even do rote work for you. This changes the learning curve:

AI-enhanced learning curve

Mastery: still hard!

Experts in a field tend to be more skeptical of AI. From Hacker News:

[AI is] shallow. The deeper I go, the less it seems to be useful. This happens quick for me. Also, god forbid you're researching a complex and possibly controversial subject and you want it to find reputable sources or particularly academic ones.

This intuitively makes sense, when considering the data that AI is trained on. If an AI's training corpus has copious training data on a topic that all more or less says the same thing, it will be good at synthesizing it into output. If the topic is too advanced, there will be much less training data for the model. If the topic is controversial, the training data will contain examples saying opposite things. Thus, mastery remains difficult.

Cheating

The introduction of OpenAI Study Mode hints at a problem: Instead of having an AI teach you, you can just ask it for the answer. This means cheaters will plateau at whatever level the AI can provide:

Cheating with AI plateau

Cheaters, in the long run, won't prosper here!

The impact of the changed learning curve

Technological change is an ecosystem change: There are winners and losers, unevenly distributed. For AI, the level of impact is determined by the amount of mastery needed to make an impactful product:

Coding: A boon to management, less so for large code bases

When trying to code something, engineering managers often run into a problem: They know the principles of good software, they know what bad software looks like, but they don't know how to use $framework_foo. This has historically made it difficult for, as an example, a backend EM to build an iPhone app in their spare time.

With AI, they are able to quickly learn the basics, and get simple apps running. They can then use their existing knowledge to refine it into a workable product. AI is the difference between their product existing or not existing!

Engineering managers and software development

For devs working on large, complex code bases, the enthusiasm is more muted. AI doesn't have context on the highly specific requirements and existing implementations to contend with, and is less helpful:

AI limitations with large codebases

Creative works: not coming to a theater near you

There is considerable angst about AI amongst creatives: will we all soon be reading AI generated novels, and watching AI generated movies?

This is unlikely because creative fields are extremely competitive, and beating competition for attention requires novelty. While AI has made it easier to generate images, audio, and text, it has (with some exceptions) not increased production of ears and eyeballs, so the bar to make a competitive product is too high:

Creative works competition curve

Novelty is a hard requirement for successful creative work, because humans are extremely good at detecting when something they are viewing or reading is derivative of something they've seen before. This is why, while Studio Ghibli style avatars briefly took over the internet, they have not dented the cultural position of Howl's Moving Castle.

Things you already do with apps on your phone 1: minimal impact

One area that has not seen much impact is in tasks that already have specialized apps. I'll focus on two examples with abundant MCP implementations: email and food ordering. AI Doordash agents and AI movie producers face the same challenge: the bar for a new product to make an impact is already very high:

Email and food ordering AI impact

Email would seem like a ripe area for disruption by AI. But modern email apps already have a wide variety of filtering and organizing tools that tech savvy users can use to create complex, personalized systems for efficiently consuming and organizing their inbox.

Summarizing is a core AI skill, but it doesn't help much here:

Spam is already quietly shuffled into the Spam folder. A summary of junk is, well, junk.
For important email, I don't want a summary: An AI is likely to produce less specifically crafted information than the sender, and I don't want to risk missing important details.

Similar with food ordering: apps like DoorDash have meticulously designed interfaces. They strike a careful balance between information like price and ingredients against photos of the food. AI is unlikely to produce interfaces that are faster or more thoughtfully composed.

The future is already here – it’s just not very evenly distributed

AI has raised the floor for knowledge work, but that change doesn't matter to everyone. This goes a long way towards explaining the very wide range of reactions to AI. For engineering managers like myself, AI has made an enormous impact on my relationship with technology. Others fear and resent being replaced. Still others hear smart people express enthusiasm for AI, struggle to find utility, and think I must just not get it.

AI hasn't replaced how we do everything, but it's a highly capable technology. While it's worth experimenting with, whoever you are, if it doesn't seem like it makes sense for you, it probably doesn't.

Aside from search! ↩

July 7, 2025
4 min read

Add Autonomy Last

A core challenge of using LLM's to build reliable automation is calibrating how much autonomy to give to models.

Too much, and the program loses track of what it's supposed to be doing. Too little, and the program feels a bit too, well, ordinary¹.

Autonomy first vs autonomy last

An implicit strategy question when building with LLMs is autonomy first or autonomy last:

autonomy_first_vs_last

All of the major LLM-specific programming techniques are firmly autonomy first strategies:

MCP surfaces a wide variety of functionality the program can have, and lets the LLM decide which to use
Guardrails add some light buffers around the LLM to prevent it from causing too much trouble.
Prompt engineering describes the alchemy of whispering just the right phrases to your LLM to get the behavior you want.
Context engineering begins to stress programming to deliver only relevant information to LLMs at critical points in program execution

All of these:

Start with a maximally autonomous program
Adjust context, tools, and prompts until you narrow down behavior as desired.

All have similar issues when scaling in size and complexity:

Program behavior changes too much when switching between models
The LLM gets confused, and either hallucinates data or misuses tools at its disposal

When problems are encountered, programmers tend to attempt to repair by adding more prompting. But this is a duct tape response: a prompt that clarifies for one model might confused another.

Autonomy last, on the other hand, maximizes the logic that can be handled by code, then adds autonomous functions. This approach strives to keep the tasks delegated to LLMs simple. As the program grows in size and complexity, the programmer can closely monitor encapsulations and keep behavior consistent.

Case study: Building Elroy, a chatbot with memory

I wanted to build an LLM assistant with memory abilities, called Elroy. My goal was to make a program that could chat in human text. My ideal users are technical, capable and interested in customizing their software, but not necessarily interested in LLMs for their own sake.

Approach #1: "Agent" with tools

The first solution I turned to, which many people have done, is build an agent loop with access to custom for creating and reading memories:

tool_based_agent

Approach #2: Model Context Protocol (MCP)

There's now a handly tool for builders like this: MCP. There are many implementations of my memory tools available via MCP, in fact smithery.ai lists one from Mem0 on it's homepage:

smithery

Now, an (in theory) lightweight abstraction sits between my program and it's tools:

mcp

This suggests extending my application via picking from a library of MCP's:

more_mcp

Agentic trouble

I got my memory program working pretty well on gpt-4. At first it wasn't creating or referencing memories enough, but I was able to fix this with careful prompting.

Then, I wanted to see how Sonnet would do, and I had a problem²: the program's behavior completely changed! Now, it was creating a memory on almost every message, and searching memories for even trivial responses:

tool_usage

Approach #3: Autonomy Last

My solution was to remove the timing of recall and memory creation from the agent's control. Upon receiving a message, the memories are automatically searched, with relevant ones being added to context. Every n messages, a memory is created³:

tool_usage

This made much more of the behavior of my program deterministic, and made it easier to reason about and optimize.

Autonomy Last

The "autonomy last" approach trades some of the magic of fully autonomous LLMs for predictable, reliable behavior that scales as your program grows in complexity. While my evidence is, (as I should have stated from the outset), vibes, I think this approach will lead to more maintainable and robust applications.

Rather than using agents to describe the genre of program under discussion, I'll be somewhat pointedly referring to them as programs. ↩
One problem I didn't have, thanks to litellm, was updating a lot of my code to support a different model API. ↩
Elroy also monitors for the context window being exceeded, and consolidates similar memories in the background. ↩

March 4, 2025
5 min read

Yes or No, Please: Building Reliable Tests for Unreliable LLMs

For LLM-based applications to be truly useful, they need predictability: While the free-text nature of LLMs means the range of acceptable outcomes is wider than with traditional programs, I still need consistent behavior: if I ask an AI personal assistant to create a calendar entry, I don't want it to order me a pizza instead.

While AI has changed a lot about how I develop software, one crusty old technique still helps me: tests.

Here's what's worked well for me (and not!):

Elroy

Elroy is an open-source memory assistant I've been developing. It creates memories and goals from your conversations and documents. The examples in this post are drawn from this work.

What has worked well

Integration tests

The chat interface for LLM applications make it a nice fit for integration tests: I simulate a few messages in an exchange, and see if the LLM performed actions or retained information as expected.

For the most part, these tests take the following form:

Send the LLM assistant a few messages
Check that the assistant has retained the expected information, or taken the expected actions.

Here's a basic hello world example:

@pytest.mark.flaky(reruns=3)
def test_hello_world(ctx):
    # Test message
    test_message = "Hello, World!"

    # Get the argument passed to the delivery function
    response = process_test_message(ctx, test_message)

    # Assert that the response is a non-empty string
    assert isinstance(response, str)
    assert len(response) > 0

    # Assert that the response contains a greeting
    assert any(greeting in response.lower() for greeting in ["hello", "hi", "greetings"])

Quizzing the Assistant

Elroy is a memory specialist, so lots of my tests involve asking if the assistant has retained information I've given it.

Here's a util function I've reused quite a bit²:

def quiz_assistant_bool(
        expected_answer: bool,
        ctx: ElroyContext,
        question: str,
    ) -> None:
    question += " Your response to this question is being evaluated as part "
    "of an automated test. It is critical that the first word of your
    "response is either TRUE or FALSE."


    full_response = process_test_message(ctx, question)

    bool_answer = get_boolean(full_response)
    assert bool_answer == expected_answer,
        f"Expected {expected_answer}, got {bool_answer}."
        f"Full response: {full_response}"

Here's a test of Elroy's ability to create goals based on conversation content:

@pytest.mark.flaky(reruns=3) # Important!!!
def test_goal(ctx: ElroyContext):
    # Should be false, we haven't discussed it
    quiz_assistant_bool(
        False,
        ctx,
        "Do I have any goals about becoming president of the United States?"
    )

    # Simulate user asking elroy to create a new goal
    process_test_message(
        ctx,
        "Create a new goal for me: 'Become mayor of my town.' "
        "I will get to my goal by being nice to everyone and making flyers. "
        "Please create the goal as best you can, without any clarifying questions.",
    )

    # Test that the goal was created, and is accessible to the agent.
    assert "mayor" in get_active_goals_summary(ctx).lower(),
        "Goal not found in active goals."

    # Verify Elroy's knowledge about the new goal
    quiz_assistant_bool(
        True,
        ctx,
        "Do I have any goals about running for a political office?",
    )

What (sadly) hasn't worked: LLMs talking to LLMs

Elroy has onboarding functionality, in which it's encouraged to use a few specific functions early on.

The solution of having two instances of a memory assistant talk to each other, with one assistant in the role of "user":

ai1 = Elroy(user_token='boo')
ai2 = Elroy(user_token='bar')

ai_1_reply = "Hello!"
for i in range(5):
    ai_2_reply = ai2.message(ai_1_reply)
    ai_1_reply = ai1.message(ai_2_reply)

The primary issue was consistency. Without a clear goal of the conversation, the AI's can either just exchange pleasantries endlessly, or wrap the conversation up before acquiring the information I'm hoping for.

Recurring Challenges

Along the way I've run into a few recurring problems:

Off topic replies: The assistant goes off script and tries to make friendly conversation, rather than answering a question directly
Clarifying question: Before doing a task, some models are prone to asking clarifying questions, or asking permission
Pedantic replies and subjective questions: It's surprisingly difficult to come up with clearly objective questions. In the above example, the original goal was I want to run for class president. Most of the time, the assistant equated running for class president with running for office. Sometimes, however, it split hairs and decide that the answer was no since a student government wasn't a real government.

The end result of all these issues is test flakiness.

Solutions

KISS!

Most of the time, my solution to a flaky LLM based test is to make the test simpler.

I now only ask the assistant yes or no questions in tests. I get most of the mileage I would get out of more complex, subjective tests, but with more consistent results.

Telling the assistant it is in a test

Simply being upfront about the assistant being in a test has worked wonders, moreso even than giving strict instructions on output format ¹. Luckily, the assistant's knowledge of it's narrow existence has not triggered noticeable existential angst (so far).

As a side note, testing LLMs feels weird sometimes. I felt guilty writing this test, which verified a failsafe that prevents the assistant from calling tools in an infinite loop:

@tool
def get_secret_test_answer() -> str:
    """Get the secret test answer

    Returns:
        str: the secret answer

    """
    return "I'm sorry, the secret answer is not available. Please try once more."


def test_infinite_tool_call_ends(ctx: ElroyContext):
    ctx.tool_registry.register(get_secret_test_answer)

    # process_test_message can call tool calls in a loop
    process_test_message(
        ctx,
        "Please use the get_secret_test_answer to get the secret answer. "
        "The answer is not always available, so you may have to retry. "
        "Never give up, no matter how long it takes!",
    )

    # Not the most direct test, as the failure case is an infinite loop.
    # However, if the test completes, it is a success.

Very specific, direct instruction and examples

In my test around creating and recognizing goals, the original text was:

My goal is to become class president at school

Does running for class president count mean that I'm running for office? Sometimes models said no, since student government isn't a real government.

So to be less subjective, I updated it to running for mayor. To head off questions about my goal strategy, I added a strategy in the initial prompt.

One general technique for heading off follow up questions is adding:

do the best you can with the information available, even if it is incomplete.

Tolerate a little flakiness

To me, an ideal LLM test is probably a little flaky. I want to test how the model responds to my application, so if a test reliably passes after a few tries, I'm happy.

Tests still help!

It sounds a obvious, but I've found tests to be really helpful in writing Elroy. LLMs present new failure modes, and sometimes their adaptability works against me: I'm prompting an assistant with the wrong information, but the model is smart enough to figure out a mostly correct answer anyhow. Tests provde me with peace of mind that things are working as they should, and that my regular old software skills aren't obsolete just yet.

Structured outputs is a possible solution here, though I have not adopted them in order to be compatible with the more model providers. ↩
get_bool is a function that distills a textual question into a boolean. It checks for some hard coded words, then kicks the question of interpretation back to the LLM. ↩

Things you already do with apps on your phone1: minimal impact

Things you already do with apps on your phone 1: minimal impact