VIDEO SCRIPT: How AI Agent Harnesses Actually Work

For Burke Holland | ~15-20 minutes

COLD OPEN

[ON CAMERA]

Hey, so — have you ever watched Copilot spin for 30 seconds and thought: what is it actually doing in there?

Like, you asked it to refactor a module, and it just... goes. It reads files, edits files, runs tests, reads the output, edits more files. And it looks almost like a developer working. But it's not magic. There's a loop running in actual code, and once you understand it, you'll write better prompts, build better AI features, and stop feeling like you're shouting into a black box.

That's what this video is about. We're going to open up the VS Code Copilot source code and actually look at the machinery. Real files. Real code. And I'm going to connect it to some fundamentals about how LLMs work — because those fundamentals directly change how you should use these tools.

Let's go.

SECTION 1: TOKENS — THE UNIT OF EVERYTHING

[SLIDE: "What is a token?"]

Before we touch any code, we need to talk about tokens. Not because I want to give you a lecture, but because tokens are the unit of everything in this world. Context limits, costs, latency — it all comes back to tokens.

So: a token is roughly a word fragment. Not a word — a fragment. "unbelievable" might be 3-4 tokens. "API" is 1 token. A space character before a word is often part of the token. It's based on a vocabulary trained along with the model, using an algorithm called BPE — byte pair encoding. You don't need to memorize that, just know that it's not chars, and it's not words.

[SLIDE: rough token examples — "Hello" = 1, " world" = 1, "unbelievable" = 3-4, code snippet showing how dense code tokenizes]

Here's why it matters practically:

Context limits. Every model has a context window — the maximum number of tokens it can "see" at once. GPT-4o is 128k. Claude 3.5 is 200k. When you have a long conversation and attach a bunch of files, you're filling up that window. Once it's full, the model starts forgetting things — either by truncating the conversation history, or by failing entirely.

Costs. You pay per token. Input tokens (what goes in) are cheaper than output tokens (what comes out). In an agentic loop — which we'll get to — the model might make 20 LLM calls to complete a task. Each call has an input that includes the entire conversation history so far. Those costs add up, and caching (which we'll also cover) is the main lever to manage them.

Prompt design. When you write prompts, you're writing tokens. Long, rambling instructions aren't just annoying — they cost money and eat into the context window that could be used for your actual code. Precision is free. Verbosity is expensive.

SECTION 2: HOW LLMS ACTUALLY GENERATE TEXT

[SLIDE: "Autoregressive generation"]

Okay, here's the part most people skip over, and it's actually important. How does an LLM generate a response?

The answer is: one token at a time, left to right, where each new token is predicted based on all the previous tokens.

This is called autoregressive generation. You give the model a sequence of tokens — your prompt — and it predicts: "given everything so far, what's the most likely next token?" It samples that token, appends it to the sequence, and repeats. Over and over until it hits a stop token or a length limit.

[SLIDE: diagram showing token sequence building up, each step adding one token]

[System prompt] [User message] → token1
[System prompt] [User message] [token1] → token2
[System prompt] [User message] [token1] [token2] → token3
...

This has a direct practical implication: the model has no memory between calls. Every single time you call the LLM, you're sending the entire conversation history from scratch. There's no "session state" on the model side. The illusion of continuity comes from you — or the harness — collecting all the previous messages and sending them every time.

SECTION 3: ATTENTION — WHY THE MODEL CAN "LOOK BACK"

[SLIDE: "Attention mechanism — briefly"]

One more foundational concept, and I'll be quick.

The "magic" that lets the model connect a word from 10,000 tokens ago to what it's generating right now is called attention. Specifically, self-attention. Every token in the sequence can attend to — look at — every other token and figure out which ones are relevant to predicting the next token.

[SLIDE: simplified attention visualization — token connecting to relevant earlier tokens]

The insight is: it's not reading left to right sequentially like a human. It's doing this massive parallel operation where every position in the sequence considers every other position. That's what the transformer architecture is built on.

Why does this matter to you? Two reasons:

Context placement matters. The model "attends" differently to things at the start of a prompt vs. the end. Key instructions and constraints generally should be at the top of the system prompt, not buried in the middle of a long document you attached.
The model is reading everything at once. When the agent is in its loop — and we'll look at the actual code for this — each call to the LLM is sending a growing blob of context: the system prompt, all the conversation history, every tool call, every tool result. The model processes all of that simultaneously. Which means by call 15 of an agentic loop, the context is enormous.

SECTION 4: PROMPT CACHING — THE PERFORMANCE TRICK

[SLIDE: "Prompt Caching"]

Here's something that doesn't get talked about enough: prompt caching.

The problem is this: in an agentic loop, each LLM call gets more expensive because the conversation history keeps growing. By the 10th iteration, you're sending 10x as many input tokens as the first call. But most of those tokens haven't changed — they're the same system prompt, the same tool results from earlier rounds.

Prompt caching lets you say to the API: "I'm going to mark certain points in this prompt as cache breakpoints. If the content up to that breakpoint hasn't changed since the last call, don't recompute it — use the cached KV-state." The cost of cached input tokens is typically 90% lower than uncached.

[DEMO: Open src/extension/intents/node/cacheBreakpoints.ts in vscode-copilot-chat]

Let's look at how Copilot actually implements this. This is the file cacheBreakpoints.ts:

/**
 * Prompt cache breakpoint strategy:
 *
 * The prompt is structured like
 * - System message
 * - Custom instructions
 * - Global context message (has prompt-tsx cache breakpoint)
 * - History
 * - Current user message with extra context
 * - Current tool call rounds
 *
 * Below the current user message, we add cache breakpoints to the last tool result in each round.
 * We add one to the current user message.
 * And above the current user message, we add breakpoints to an assistant message with no tool calls.
 */
export function addCacheBreakpoints(messages: Raw.ChatMessage[]) {
    const MaxCacheBreakpoints = 4;
    // ...walking messages in reverse, placing breakpoints strategically
}

See what they're doing? They're identifying the most stable parts of the prompt — the system message, previous tool results that won't change — and placing cache breakpoints after them. The current round's content (which is new every iteration) stays outside the cache. The historical context (which doesn't change) gets cached.

The comment says it all: "There will always be a cache miss when a new turn starts because the previous messages move. During the agentic loop, each request will have a hit on the previous tool result message."

[SLIDE: diagram showing prompt structure with stable/cached region vs dynamic region]

What this means for you as a developer: if you're building an agentic system, put your stable context — your system prompt, your tool definitions, your background knowledge — before the dynamic content. Structure your prompts so the stable parts are at the top and the changing parts are at the bottom. You'll pay dramatically less per iteration.

SECTION 5: THE AGENT HARNESS — THE LOOP

[SLIDE: "The Agent Loop"]

Okay. Here's the thing everyone wants to understand.

When you use Copilot in agent mode and it reads five files, creates three new ones, runs your tests, and fixes the failures — none of that is the model doing something special. It's a loop. A while(true) loop in JavaScript. And I can show you the exact code.

[DEMO: Open src/extension/intents/node/toolCallingLoop.ts in vscode-copilot-chat]

This is toolCallingLoop.ts. This is the agent harness. And the relevant method is _runLoop:

private async _runLoop(
    outputStream: ChatResponseStream | undefined, 
    token: CancellationToken
): Promise<IToolCallLoopResult> {
    let i = 0;
    let lastResult: IToolCallSingleResult | undefined;

    while (true) {
        if (lastResult && i++ >= this.options.toolCallLimit) {
            // hit the limit — stop or confirm
            lastResult = this.hitToolCallLimit(outputStream, lastResult);
            break;
        }

        const result = await this.runOne(outputStream, i, token);
        this.toolCallRounds.push(result.round);

        if (!result.round.toolCalls.length || 
            result.response.type !== ChatFetchResponseType.Success) {
            // No tool calls → model is done. Run stop hooks, then break.
            // ...stop hook logic...
            break;
        }
    }
    // ...
}

There it is. while(true). Each iteration:

Build the prompt (with all context and history)
Call the LLM (runOne)
If the response has tool calls → execute them, loop back
If no tool calls → we're done

The loop stops when the model produces a response with no tool calls — it's essentially saying "I don't need to do anything else, here's my answer." Or it hits toolCallLimit, which defaults to something reasonable and can grow in autopilot mode up to 200.

[SLIDE: flowchart of the loop — LLM call → check for tool calls → execute tools → loop or exit]

Let me read you this comment from the IToolCallingLoopOptions interface because it's a dead giveaway:

export interface IToolCallingLoopOptions {
    toolCallLimit: number;
    /**
     * What to do when the limit is hit. Defaults to Stop.
     * If set to confirm you can use isToolCallLimitCancellation 
     * and isToolCallIterationIncrease to get followup data.
     */
    onHitToolCallLimit?: ToolCallLimitBehavior;
}

In normal agent mode, hitting the limit shows you a message: "Copilot has been working on this problem for a while. It can continue to iterate, or you can send a new message." In autopilot mode, it silently extends the limit — up to 200 — without asking.

SECTION 6: WHAT `runOne` ACTUALLY DOES

[DEMO: still in toolCallingLoop.ts, scroll to runOne method]

Let me show you what happens inside a single iteration — the runOne method. This is where the real work happens:

public async runOne(
    outputStream: ChatResponseStream | undefined, 
    iterationNumber: number, 
    token: CancellationToken
): Promise<IToolCallSingleResult> {

    // 1. Get available tools
    let availableTools = await this.getAvailableTools(outputStream, token);
    
    // 2. Build the prompt — all conversation history, tool results, etc.
    const context = this.createPromptContext(availableTools, outputStream);
    const buildPromptResult = await this.buildPrompt2(context, outputStream, token);

    // 3. Convert tools to OpenAI function definitions
    const promptContextTools = availableTools.length 
        ? availableTools.map(toolInfo => ({
            name: toolInfo.name,
            description: toolInfo.description,
            parameters: toolInfo.inputSchema,
        }))
        : undefined;

    // 4. Call the LLM, collecting tool calls from the stream
    const toolCalls: IToolCall[] = [];
    const fetchResult = await this.fetch({
        messages: buildPromptResult.messages,
        requestOptions: {
            tools: promptContextTools?.map(tool => ({
                function: { name: tool.name, description: tool.description, parameters: tool.parameters },
                type: 'function',
            })),
        },
        finishedCb: async (text, index, delta) => {
            // Collect tool calls as they stream in
            if (delta.copilotToolCalls) {
                toolCalls.push(...delta.copilotToolCalls.map(call => ({
                    ...call,
                    id: this.createInternalToolCallId(call.id),
                    arguments: call.arguments === '' ? '{}' : call.arguments
                })));
            }
        },
        // ...
    }, token);

    // 5. Return the result — including any tool calls the model made
    return {
        response: fetchResult,
        round: ToolCallRound.create({ response: fetchResult.value, toolCalls, ... }),
        // ...
    };
}

[SLIDE: sequence diagram for one loop iteration]

Four things to notice here:

One: getAvailableTools is called on every iteration. The tool list can change mid-loop. Some tools only become available after certain conditions are met.

Two: buildPrompt2 is building the entire prompt fresh every time — including the history of all previous tool calls and their results. Those results are what the model uses to figure out what to do next.

Three: Tools get converted to the OpenAI function-calling format on the fly: { type: 'function', function: { name, description, parameters } }. The parameters use JSON Schema. That's the contract the model gets.

Four: Tool calls come back in the stream as delta.copilotToolCalls. They're accumulated as the response streams in. Once the stream completes, those tool calls get packaged into a ToolCallRound and on the next iteration, their results get included in the prompt.

SECTION 7: HOW TOOLS ARE REGISTERED

[DEMO: Open src/vs/workbench/contrib/chat/common/tools/builtinTools/tools.ts in vscode]

Now let's look at the other side: how do tools get registered in the first place?

export class BuiltinToolsContribution extends Disposable 
    implements IWorkbenchContribution {

    constructor(
        @ILanguageModelToolsService toolsService: ILanguageModelToolsService,
        @IInstantiationService instantiationService: IInstantiationService,
    ) {
        super();

        const editTool = instantiationService.createInstance(EditTool);
        this._register(toolsService.registerTool(EditToolData, editTool));

        const askQuestionsTool = this._register(
            instantiationService.createInstance(AskQuestionsTool)
        );
        this._register(toolsService.registerTool(AskQuestionsToolData, askQuestionsTool));
        
        // ...many more tools...
        
        const taskCompleteTool = instantiationService.createInstance(TaskCompleteTool);
        this._register(toolsService.registerTool(TaskCompleteToolData, taskCompleteTool));
    }
}

It's a contribution. VS Code's dependency injection system instantiates this class on startup, and it registers all the built-in tools with the tool service. Each registerTool call takes two things: the tool data (metadata — name, description, schema) and the tool implementation (the actual code that runs).

[DEMO: Open src/vs/workbench/contrib/chat/common/tools/languageModelToolsService.ts, look at IToolData]

Here's the IToolData interface — this is the metadata the model actually sees:

export interface IToolData {
    readonly id: string;
    readonly displayName: string;
    readonly modelDescription: string;    // ← this goes to the LLM
    readonly inputSchema?: IJSONSchema;   // ← JSON Schema for parameters
    readonly tags?: readonly string[];
    readonly canBeReferencedInPrompt?: boolean;
    readonly runsInWorkspace?: boolean;
    readonly canRequestPreApproval?: boolean;   // can ask before running
    readonly canRequestPostApproval?: boolean;  // can ask after running
    // ...
}

The modelDescription field is literally what the LLM reads to understand what the tool does and when to call it. This is your tool's documentation for the model. Write it badly and the model will misuse the tool or not use it at all.

[DEMO: Open src/extension/tools/node/readFileTool.tsx in vscode-copilot-chat]

Here's a real example — the read_file tool:

export const readFileV2Description: vscode.LanguageModelToolInformation = {
    name: ToolName.ReadFile,
    description: 'Read the contents of a file. Line numbers are 1-indexed. ' +
        'This tool will truncate its output at 2000 lines and may be called ' +
        'repeatedly with offset and limit parameters to read larger files in chunks. ' +
        'Binary files use offset/limit as byte offsets.',
    inputSchema: {
        type: 'object',
        required: ['filePath'],
        properties: {
            filePath: {
                description: 'The absolute path of the file to read.',
                type: 'string'
            },
            offset: {
                description: 'Optional: the 1-based line number to start reading from. ' +
                    'Only use this if the file is too large to read at once.',
                type: 'number'
            },
            limit: {
                description: 'Optional: the maximum number of lines to read. ' +
                    'Only use this together with offset if the file is too large.',
                type: 'number'
            },
        }
    },
};

Notice how carefully written this description is. It tells the model: the truncation limit, when to use offset/limit, what the line numbering convention is. All of this gets sent to the LLM every single call, in the tools array. The model reads these descriptions to decide which tools to call and how to call them.

SECTION 8: WHAT THE PROMPT ACTUALLY LOOKS LIKE

[SLIDE: "Anatomy of an Agent Prompt"]

Let me show you the structure of what actually gets sent to the LLM during an agentic loop. This is from agentPrompt.tsx — the main prompt component for agent mode.

[DEMO: Open src/extension/prompts/node/agent/agentPrompt.tsx]

The prompt is built with a library called @vscode/prompt-tsx — basically JSX for prompts. It renders into a list of messages. The structure, in order, is:

┌─────────────────────────────────────────┐
│  System message (base instructions)     │  ← stable, cached
│  Safety rules                           │
│  Custom instructions (user-defined)     │  ← fairly stable
│  Workspace context (file structure)     │  ← stable per turn
│  ─────────────────── cache breakpoint ─│
│  Conversation history (prior turns)     │  ← grows over time
│  Previous tool call rounds              │  ← grows over time
│  ─────────────────── cache breakpoint ─│
│  Current user message                   │  ← new each turn
│  Current tool results                   │  ← new each iteration
└─────────────────────────────────────────┘

[SLIDE: same diagram, highlighted showing "what the model sees on call 10 of the loop"]

By iteration 10, that "current tool results" section is growing fast — it contains 9 rounds of tool calls and results. The system prompt and conversation history? They've been cached since iteration 1.

And here's the key thing: the model doesn't "remember" between iterations. Every single time you call it, you send this whole thing from scratch. The illusion of the model "working toward a goal" comes entirely from the prompt containing the accumulated history of what it's done so far.

The agent is stateless. The harness provides the state.

SECTION 9: TOOLS → TOOL RESULTS → NEXT PROMPT

[SLIDE: "The Full Cycle"]

Let me walk through one complete cycle so you can see it clearly:

[SLIDE: step-by-step numbered diagram]

Step 1: User says: "Refactor the auth module to use JWT."

Step 2: The harness builds the initial prompt and calls the LLM with the available tools listed.

Step 3: LLM responds: "I need to read the auth module first." It outputs a tool call:

{
  "type": "function",
  "id": "call_abc123",
  "function": {
    "name": "read_file",
    "arguments": "{\"filePath\": \"/src/auth/index.ts\"}"
  }
}

Step 4: The loop collects this. No text output yet (or a brief thinking comment). The iteration ends.

Step 5: The tool read_file is invoked with those arguments. It reads the file and returns the contents as a LanguageModelToolResult.

Step 6: On the next iteration, the prompt includes both the tool call AND its result, as a tool role message. The model can see: "I asked to read this file, and here's what it contained."

Step 7: Now the LLM responds with another tool call — or several: maybe create_file, replace_string, multiple edits. Each one gets executed.

Step 8: This continues until the LLM produces a response with no tool calls — it generates explanatory text saying what it did. The loop exits.

That's it. That's the whole thing.

SECTION 10: HOW THIS CHANGES HOW YOU WORK

[ON CAMERA]

Okay. So now that you know all this, what do you actually do differently?

[SLIDE: "Practical implications for prompt writing"]

First: Be explicit about scope early. The model processes your request on the first call without any knowledge of your codebase. The more specific you are upfront — "refactor the UserAuthService class in src/services/auth.ts to use JWT, keeping the existing IAuthService interface intact" — the better the first tool calls will be. Vague prompts lead to exploratory tool calls that waste iterations and tokens.

Second: Think in tool calls. When you break down a task, think about how many tool calls it will require. Reading 10 files + editing 5 + running tests = at least 16+ iterations. That's normal. But if you're asking for something that would require reading your entire codebase, you might be setting up for a context window explosion. Give the model pointers: "@workspace what's the auth module structure?" first, then ask for refactoring.

Third: Understand what "agent mode" means for cost. In agent mode, each user message can trigger 5-20 LLM calls, each with a growing prompt. This is why prompt caching exists and why it matters. If you're building your own agent, structure your prompts to maximize cache reuse — stable context at the top, dynamic context at the bottom.

Fourth: Your custom instructions are baked in every time. Those .github/copilot-instructions.md files? They're included in every single prompt. If they're long and rambling, they're expensive. Keep them focused. Use them for things the model genuinely needs to know on every call — project conventions, preferred libraries, coding style. Not a history of your project.

Fifth: Tool descriptions are model-facing documentation. If you're building a VS Code extension that registers tools, your modelDescription is not for users — it's for the model. Write it the way you'd write a docstring for a very literal reader who has to decide in milliseconds whether to call this function. Include: what the tool does, when to use it, what the parameters mean, edge cases to be aware of.

[SLIDE: "Writing a good tool description"]

Bad:

description: 'Runs the tests'

Good:

description: 'Run the project test suite using the configured test runner. ' +
  'Use this after making code changes to verify correctness. ' +
  'Returns pass/fail counts, failed test names, and stdout/stderr. ' +
  'If no testPattern is specified, runs all tests. ' +
  'Running tests can take 30-60 seconds — only call this when needed.'

The extra detail prevents the model from calling the tool randomly and helps it understand when not to call it too.

SECTION 11: AUTOPILOT, STOP HOOKS, AND THE TASK COMPLETE TOOL

[ON CAMERA]

One more thing I want to show you because it's genuinely interesting.

[DEMO: back to toolCallingLoop.ts, look for task_complete and autopilot logic]

In autopilot mode, there's a special tool called task_complete. The model is supposed to call this when it believes the task is actually done. The harness looks for it:

// If the model produced productive tool calls after being nudged,
// reset the stop hook flag
if (this.autopilotStopHookActive && result.round.toolCalls.length 
    && !result.round.toolCalls.some(
        tc => tc.name === ToolCallingLoop.TASK_COMPLETE_TOOL_NAME
    )) {
    this.autopilotStopHookActive = false;
    this.autopilotIterationCount = 0;
}

And there are "stop hooks" — a hook system that runs before the loop terminates. Extensions can register a stop hook that says "wait, you're not done yet." The model then gets this message:

function formatHookContext(reasons: readonly string[]): string {
    return `You were about to complete but a hook blocked you with the 
            following message: "${reasons[0]}". 
            Please address this requirement before completing.`;
}

This is how .agents files work — they can define hooks that enforce criteria before an agent is considered done. Like: "you must run the tests and they must pass." The model will keep going until those hooks don't block anymore.

It's a neat pattern. The loop is not just LLM → tools → done. There are gates.

WRAP UP

[ON CAMERA]

Let me leave you with the mental model I want you to carry:

[SLIDE: final summary diagram]

An LLM is a stateless token-prediction machine. It has no memory, no agency, no ability to act. What makes it feel agentic is a harness that:

Maintains state (the conversation history, the tool results)
Defines a vocabulary of actions (the tools, with their descriptions and schemas)
Runs a loop: build prompt → call model → execute tools → repeat
Handles the operational concerns: caching, limits, hooks, retries

The model is powerful. But it's a component, not an agent. The harness makes it an agent.

When you write prompts, you're writing inputs to that component. When you attach files, you're filling the context window. When you use agent mode, you're kicking off that loop. Understanding this doesn't make the magic go away — it just means you can use it intentionally instead of randomly.

The code we looked at today lives at:

microsoft/vscode-copilot-chat — src/extension/intents/node/toolCallingLoop.ts (the loop)
microsoft/vscode-copilot-chat — src/extension/intents/node/cacheBreakpoints.ts (caching strategy)
microsoft/vscode — src/vs/workbench/contrib/chat/common/tools/ (tool service and registration)
microsoft/vscode-copilot-chat — src/extension/tools/node/ (tool implementations)

Go read them. They're not scary. They're just TypeScript.

See you in the next one.

DEMO FILE REFERENCE

For the on-screen demos, here are the exact files and key line numbers:

Demo	Repo	File	Lines of Interest
The main loop	`vscode-copilot-chat`	`src/extension/intents/node/toolCallingLoop.ts`	`_runLoop` method: the `while(true)` starting around line 783
`runOne` method	`vscode-copilot-chat`	`src/extension/intents/node/toolCallingLoop.ts`	`runOne` method starting around line 1037
Cache breakpoints	`vscode-copilot-chat`	`src/extension/intents/node/cacheBreakpoints.ts`	Full file — it's short (~90 lines)
Tool registration	`vscode`	`src/vs/workbench/contrib/chat/common/tools/builtinTools/tools.ts`	`BuiltinToolsContribution` class
Tool data interface	`vscode`	`src/vs/workbench/contrib/chat/common/tools/languageModelToolsService.ts`	`IToolData` interface around line 50
Tool implementation	`vscode-copilot-chat`	`src/extension/tools/node/readFileTool.tsx`	`readFileV2Description` at top + implementation
Agent prompt structure	`vscode-copilot-chat`	`src/extension/prompts/node/agent/agentPrompt.tsx`	`render()` method showing prompt composition

SLIDE DECK NOTES

Slide 1 — Tokens: Show a sentence broken into tokens with visual separators. Use tiktoken playground screenshot or recreate it. Good example: a few lines of TypeScript to show how dense code tokenizes.

Slide 2 — Autoregressive generation: Simple animation or still image. Left-to-right sequence, each step adds one token, arrow pointing forward.

Slide 3 — Attention: A simple matrix-style diagram is enough. Show one token "attending" to multiple earlier tokens with varying line weights.

Slide 4 — Prompt caching: Stacked block diagram. Stable block (green, ✓ cached) on top. Dynamic block (yellow, ✗ not cached) on bottom. Arrow showing "new each call" vs "reused."

Slide 5 — Agent Loop Flowchart:

[User message] 
     ↓
[Build prompt]
     ↓
[Call LLM]
     ↓
[Tool calls in response?] — NO → [Done: return to user]
     ↓ YES
[Execute tools, collect results]
     ↓
[Add results to prompt history]
     ↓
[Back to Build prompt]

Slide 6 — Prompt anatomy: Stacked layout with labels and cache breakpoint markers. Makes the "what the model sees" concept concrete.

Slide 7 — Final mental model: Clean summary diagram. Three boxes: "Stateless LLM" + "Tool Definitions" + "Loop/Harness" = "Agent Behavior."

Total estimated runtime: 17-20 minutes with demos Good for: YouTube tutorial, conference talk (cut demos for 12 min version)

burkeholland/video-script.md

Select an option

No results found

Select an option

No results found

VIDEO SCRIPT: How AI Agent Harnesses Actually Work

For Burke Holland | ~15-20 minutes

COLD OPEN

SECTION 1: TOKENS — THE UNIT OF EVERYTHING

SECTION 2: HOW LLMS ACTUALLY GENERATE TEXT

SECTION 3: ATTENTION — WHY THE MODEL CAN "LOOK BACK"

SECTION 4: PROMPT CACHING — THE PERFORMANCE TRICK

SECTION 5: THE AGENT HARNESS — THE LOOP

SECTION 6: WHAT `runOne` ACTUALLY DOES

SECTION 7: HOW TOOLS ARE REGISTERED

SECTION 8: WHAT THE PROMPT ACTUALLY LOOKS LIKE

SECTION 9: TOOLS → TOOL RESULTS → NEXT PROMPT

SECTION 10: HOW THIS CHANGES HOW YOU WORK

SECTION 11: AUTOPILOT, STOP HOOKS, AND THE TASK COMPLETE TOOL

WRAP UP

DEMO FILE REFERENCE

SLIDE DECK NOTES

burkeholland/video-script.md

VIDEO SCRIPT: How AI Agent Harnesses Actually Work

For Burke Holland | ~15-20 minutes

COLD OPEN

SECTION 1: TOKENS — THE UNIT OF EVERYTHING

SECTION 2: HOW LLMS ACTUALLY GENERATE TEXT

SECTION 3: ATTENTION — WHY THE MODEL CAN "LOOK BACK"

SECTION 4: PROMPT CACHING — THE PERFORMANCE TRICK

SECTION 5: THE AGENT HARNESS — THE LOOP

SECTION 6: WHAT runOne ACTUALLY DOES

SECTION 7: HOW TOOLS ARE REGISTERED

SECTION 8: WHAT THE PROMPT ACTUALLY LOOKS LIKE

SECTION 9: TOOLS → TOOL RESULTS → NEXT PROMPT

SECTION 10: HOW THIS CHANGES HOW YOU WORK

SECTION 11: AUTOPILOT, STOP HOOKS, AND THE TASK COMPLETE TOOL

WRAP UP

DEMO FILE REFERENCE

SLIDE DECK NOTES

SECTION 6: WHAT `runOne` ACTUALLY DOES