Skip to content

Instantly share code, notes, and snippets.

@rohitg00
Created April 14, 2026 09:54
Show Gist options
  • Select an option

  • Save rohitg00/b0d629229cbb0f28d05c16662543e633 to your computer and use it in GitHub Desktop.

Select an option

Save rohitg00/b0d629229cbb0f28d05c16662543e633 to your computer and use it in GitHub Desktop.
Pro Workflow: a system for working with AI coding agents that compounds across sessions. Lessons from building pro-workflow with patterns from Karpathy, Boris Cherny, Lance Martin, HumanLayer, and others.

Pro Workflow

A system for working with AI coding agents that compounds across sessions. Distilled from running pro-workflow in production for several months: shipped initially as a Twitter thread distillation, rewritten three times since, currently 24 skills, 8 agents, 21 commands, 29 hook scripts across 24 hook events, with a SQLite-backed learning store underneath.

This document is what we wish we'd had on day one. Not the marketing version. The honest version: what worked, what we got wrong, who we learned from, and why the system today looks nothing like the system we started with.

The foundation is Andrej Karpathy's observation that 80% of his code is written by AI and 20% is spent reviewing and correcting it. The ratio is correct. The interesting question is what happens to the 20%, and the boring answer most workflows ship is "nothing." Corrections come in, mistakes get patched, the next session starts at zero. We've been chasing a different answer: the 20% is the most valuable signal in the whole loop, and a workflow that doesn't capture it is leaking the only thing that compounds.

What Karpathy got right, and what nobody else acted on

The 80/20 framing is a year old now. It got widely retweeted. Most reactions were either "wow great quote" or "he's wrong, more like 95/5." Both miss the point.

The actionable insight isn't the ratio. It's the asymmetry. Code generation is cheap and fast. Correction is expensive and slow. If you can shift work from the expensive side to the cheap side, you win. Every minute of correction prevented is worth ten minutes of generation, because correction breaks flow and generation doesn't.

Pro-workflow exists to act on that asymmetry. Every pattern in it either prevents corrections (so you spend less of the expensive 20%) or captures corrections (so you don't pay for the same one twice). Nothing else in the system matters as much as those two things.

The follow-up insight, also from Karpathy, came in his October post on coding pitfalls: models silently pick interpretations, overcomplicate diffs, refactor adjacent code they weren't asked to touch, and remove things they don't understand. He named the failure modes precisely. We encoded them as rules in a recent version.

The keystone: self-correction loop

If you adopt nothing else from pro-workflow, adopt this. It's the pattern the entire system orbits.

When the agent makes a mistake, three things happen in order. First, it acknowledges what specifically went wrong. Not "I made an error" but "I edited src/utils.ts when you meant src/lib/utils.ts." Specificity is non-negotiable because vague acknowledgements teach nothing. Second, it proposes a rule that would have prevented the mistake, formatted for capture: [LEARN] Navigation: Confirm full path before editing files with common names. Third, it waits for your approval before persisting anything.

The approval gate is the part most implementations get wrong. We got it wrong first too. The first version of pro-workflow auto-saved every correction the agent flagged. Within a week the rules file was full of contradictions, overcorrections, and misreadings of what the user actually meant. By month two we had to add manual approval, which dropped the noise rate by about 90%. The lesson generalized: any system that converts user feedback into permanent state needs a human in the loop, every time, no exceptions. Auto-capture is how you train the agent to be wrong.

The second thing we got wrong was storage. Version 1 stored learnings as a markdown section in CLAUDE.md. That works up to maybe 30 rules. Past that, the file becomes too long for the agent to parse efficiently, and you can't search it. Version 2 split learnings into their own file. Version 3 moved them into SQLite with FTS5 full-text search, which is the current architecture. The migration was painful and worth it. With 100+ rules accumulated across projects, search isn't optional. You need to be able to ask "what did I learn about testing?" and get the relevant five rules in context, not all hundred.

The mechanics of the loop sound simple but they break in subtle ways. Three traps:

Auto-saving without approval generates noise within a week. Approval gates noise.

Vague rules teach nothing. "Be more careful" is not a rule. "Confirm full path before editing files with common names" is. The agent should be coached to write specific rules and you should reject vague ones with /learn-rule until it learns the format.

Capture friction kills the habit. If recording a learning requires switching tools, opening a different file, breaking flow, you'll stop. Pro-workflow inlines capture into the same conversation where the mistake happened, with a Stop hook that auto-detects [LEARN] blocks in the assistant's last message. Friction matters more than features.

Pre-flight discipline: catching mistakes before they happen

Self-correction is the lower half of the loop. It catches things after they break. The upper half is preventing the break in the first place.

The four pre-flight rules (a recent addition, encoded directly from Karpathy's observations on LLM coding pitfalls) target the specific failure modes that no amount of post-hoc correction prevents:

Surface, don't assume. State assumptions before implementing. If multiple interpretations exist, present them. If a simpler approach exists than what was asked, say so. The cost of one clarifying question is always less than the cost of unwinding 200 lines down the wrong interpretation.

Minimum viable code. No features beyond what was asked. No abstractions for single-use. No error handling for impossible scenarios. The senior-engineer test: would they call this overcomplicated? If yes, simplify before showing.

Stay in your lane. Every changed line must trace to the user's request. No drive-by refactors. No "improvements" to adjacent code. Spotted dead code? Mention it but don't delete it. The reason isn't that improvements are bad, it's that bundled improvements make diffs unreviewable, and unreviewable diffs ship bugs.

Verifiable goals. Convert "fix the bug" into "write a failing test that reproduces it, then make it pass." Strong success criteria let the agent loop independently. Weak criteria require constant re-clarification.

These pair with self-correction in a specific way: pre-flight stops the mistake, self-correction captures the lesson when one slips through. Each pattern fixes the other's blind spot. Without pre-flight, self-correction has too much to capture and the rule store becomes a junk drawer. Without self-correction, pre-flight has no learning loop and stays static while the project evolves.

Hooks: the enforcement layer

Rules in markdown are suggestions. The agent might follow them, might not. Hooks turn rules into enforcement that doesn't depend on the agent remembering.

Pro-workflow started with 8 hook events in v1. By v2 it was 18. Today it's 24, with 29 scripts, including the recently shipped LLM gate hooks (type: "prompt") that use a second LLM call to validate things like commit messages and secret detection. The progression matters because each new hook event was a response to a specific failure mode that nudges and rules couldn't cover.

A few examples of what each layer does:

SessionStart loads accumulated learnings from SQLite into the agent's context. Without this hook, every session restarts with no memory. With it, the agent at month 6 starts every session with a hundred rules already in its head.

PreToolUse on Edit/Write tracks edit count. At 5 edits, soft reminder to run quality gates. At 10, harder reminder. Not blocking, just nudging at intervals where the cost of remembering is low and the cost of forgetting is high.

PostToolUse on code edits scans the diff in real time for console.log, debugger statements, hardcoded secrets. Catches things before they hit version control without running a separate linter.

Stop uses the agent's last assistant message to give context-aware reminders. If the agent just edited five files, it nudges toward a quality check. If it just ran tests, it nudges toward a commit. The nudges feel intelligent because they're based on what the agent actually did, not a generic checklist.

LLM gates (the v3.2 leap) use a separate LLM call to validate things humans would normally catch in review. Commit message quality, secret leakage, dangerous bash patterns. The model that writes the code is not the model that gates the code. Same family, different prompts, different context. This is one of the few places where adding LLM calls is unambiguously net positive: the cost is small, the catch rate is high, and the failure mode (false positive) is cheap to override.

PermissionDenied tracks denial patterns. After a week, the /permission-tuner skill analyzes denials and proposes optimized allow/deny rules. Most prompt fatigue in Claude Code comes from the same five denials repeating. Track them, batch them, fix them once.

The hook philosophy is non-blocking by default, blocking only on genuinely destructive operations. Force-pushes to main, rm -rf against unfamiliar directories, DROP TABLE. The asymmetry is correct: nagging on minor things teaches you to disable hooks, but blocking on destructive things saves you from yourself. Get the asymmetry right and hooks feel like a partner. Get it wrong and they feel like a bureaucracy.

Orchestration: Command, Agent, Skill

For features that touch many files or need architecture decisions, the right pattern is three layers, each with a single job.

A command is the user-facing entry point. /develop, /wrap-up, /learn-rule. Commands are thin. They route to agents.

An agent is the executor. Constrained tools, specific instructions, preloaded skills. Pro-workflow ships eight: planner (read-only, approval-gated), reviewer (security and logic), scout (background exploration), orchestrator (multi-phase), debugger (hypothesis-driven), context-engineer (window analysis), permission-analyst, cost-analyst. Each has one job and does it well. Generic "do everything" agents always lose to specialized ones.

A skill is domain knowledge, loaded into the agent at startup. Skills are reusable across agents. The same api-conventions skill can inform the planner, the reviewer, and the orchestrator, each of which uses it differently. Skills are how you give an agent context about your specific project without polluting global state.

The pattern matters because each layer changes independently. You can swap the workflow (command), the executor (agent), or the knowledge (skill) without rewriting everything else. Without the layering, you end up with monolithic prompts that try to be everything, and they get unwieldy fast.

The canonical example is multi-phase development. /develop launches the orchestrator agent, which runs four phases: research (explores the codebase, scores confidence 0-100), plan (presents approach, files to change, risks; waits for approval), implement (executes the plan with quality gates every five edits), review (the reviewer agent checks the work). You never skip phases. You never proceed without approval between them. The structure prevents the most common failure mode: ask for a feature, get a 500-line diff back, have no idea whether it's right.

This pattern came from HumanLayer's Research-Plan-Implement workflow which they call RPI, and from conversations with people running similar setups. Boris Cherny (Claude Code's creator) has made the same point repeatedly: "If you do something more than once a day, turn it into a skill or command." We took that literally and ended up with 21 commands.

Parallel worktrees and agent teams

Most AI coding sessions have long pauses. The agent thinks, runs tests, builds, waits for CI. The natural human response is to context-switch to Slack. The natural correct response is to start another agent on a different problem.

Native worktrees in Claude Code 2.1.49 made this easy. claude --worktree creates an isolated copy of the repo, the agent works there, your main checkout stays untouched. While Agent A waits on tests, Agent B starts a new feature. While Agent B waits on a build, Agent C debugs an unrelated issue. Three agents in flight, zero conflict.

The mechanism isn't the hard part. The discipline is. If you treat parallel sessions as exotic, you'll start one and forget. If you treat them as default, you'll naturally reach for claude -w whenever you sense a long operation coming. Pro-workflow's parallel-worktrees skill encodes the decision rules and the WorktreeCreate and WorktreeRemove hooks track the lifecycle so you don't end up with 12 abandoned worktrees and no idea which is which.

Agent teams (Claude Code's experimental multi-instance feature) take this further. One lead session coordinates, multiple teammates work independently, they message each other directly. We were skeptical of this at first because most "AI coordination" features add overhead without value. Agent teams turn out to be different: they actually save time on parallel reviews, competing hypotheses, and cross-layer changes where the layers can be debugged independently. They don't help on monolithic features, and they hurt when the coordination overhead exceeds the parallelism gain. The decision rule we settled on: use teams when the work decomposes into 3+ genuinely independent threads. Otherwise stick with worktrees and one session per worktree.

Context engineering

Lance Martin's Write/Select/Compress/Isolate framework is the best mental model we've found for managing the agent's context window. We adapted it as the context-engineering skill in v3.0 and the docs guide in docs/context-engineering.md.

Write means producing context: research, planning notes, summaries the agent generates as it works. Most workflows treat writing as a side effect. The framework treats it as a first-class activity that needs structure.

Select means choosing what to load: which files, which docs, which prior conversations. Default behavior loads too much. The discipline is loading the minimum needed for the current step.

Compress means reducing context without losing signal. Summarize a long file into its interface. Extract the relevant 50 lines from a 500-line file. Replace verbose tool outputs with structured summaries.

Isolate means keeping context boundaries clean. Subagents are the primary tool here: spawn a subagent for high-volume work (test runs, log analysis, dependency exploration), let it produce a summary, discard the verbose intermediate state. The main session never sees the noise.

The 200K context window feels generous until you actually use it. Without context engineering, a single feature implementation can consume 80% of the budget and force a compaction that loses the important parts. With it, the same feature uses 30% and leaves headroom for the next thing. The compounding effect of disciplined context use across a session is larger than any single optimization.

Wrap-up ritual and split memory

Sessions need endings. Without one, you stop when you get tired or when the context window fills, and the next session starts with no memory of what happened.

The wrap-up ritual takes about three minutes. What changed (modified files, uncommitted work). State of the world (git status, tests, lint). What was learned (mistakes worth capturing, patterns worth keeping). What's next (open threads, blockers). One paragraph summary.

The summary is the part most people skip and the part that does the most work. It compresses a session into something the next session can read in two seconds. Without it, resuming work means re-loading context from raw history, which is slow and lossy. With it, you can resume a project a week later in under a minute.

For complex projects, CLAUDE.md eventually outgrows itself and you split it: CLAUDE.md stays small as the entry point, AGENTS.md holds workflow rules, SOUL.md holds style preferences, LEARNED.md is auto-populated by the self-correction loop. Each file has a different lifecycle: CLAUDE.md changes rarely, AGENTS.md changes when your workflow changes, SOUL.md when you change your mind about style, LEARNED.md grows continuously. Putting them in one file means every change touches everything, which means you stop changing anything.

A monorepo can take this further: each package gets its own scoped CLAUDE.md, plus a root one that applies globally. The agent loads them in order, more specific overriding less specific. This is how you handle a codebase where frontend conventions differ from backend conventions without confusing the agent.

The tooling layer (v3.2)

The recent additions aren't core patterns. They're the tooling that makes core patterns work in production at scale.

Cost tracker surfaces session costs in real time and benchmarks them against budgets. We didn't think we needed this until we started seeing $50 sessions on routine work and realized we had no idea where the cost went. Now we know, and we can correct.

MCP audit analyzes MCP server token overhead per request. Most teams enable too many MCPs (15+) and get hit by per-request token costs they didn't budget for. The audit shows the cost per server, you disable the expensive ones that aren't pulling weight, the budget recovers. Rule of thumb: under 10 MCPs, under 80 tools.

Permission tuner analyzes denial patterns. Most prompt fatigue comes from the same handful of denials repeating. Pro-workflow tracks them via the PermissionDenied hook, the tuner proposes optimized rules, you batch-approve once, the fatigue stops.

Compact guard protects context through compaction cycles. The default behavior loses important state during compaction. The guard saves a budget (50K) of critical context before compaction and re-injects it after. This is the difference between a session that survives a compact and one that effectively restarts.

Auto setup detects project type on init and configures quality gates automatically. Without it, every new project starts with a manual configuration step that gets skipped. With it, you get sensible defaults from the first edit.

File watcher triggers reactive workflows on config and dependency changes. Updated package.json? Run npm install. Updated .env? Reload the dev server. Updated CI config? Validate the workflow file. Removes a category of "I forgot to do the followup step" mistakes.

None of these are the headline feature. All of them collectively account for most of the day-to-day quality-of-life improvements over the past two months.

Cross-agent: SkillKit

The patterns above aren't Claude-specific. They work in Cursor, Codex, Gemini CLI, Windsurf, OpenCode, and 27 other agents. The challenge is that each agent has its own format for skills, rules, and configurations.

SkillKit is the translation layer. You write a skill once in a canonical format. SkillKit translates it to the format each target agent expects. One source, many targets. Update the source, re-run translation, every agent updates.

This matters because lock-in is the silent tax of AI tooling. If your habits and rules only work in one tool, switching costs are prohibitive. If they travel, you can use the right tool for the job (Cursor for tab completion, Claude Code for hard problems, Codex for one-shot scripts) without losing the cumulative work you put into your workflow.

Pro-workflow ships as a SkillKit package. npx skillkit install pro-workflow and npx skillkit translate pro-workflow --agent cursor is the entire setup for a Cursor user. Same patterns, same hooks adapted to the target's hook model, same self-correction loop.

What we learned from leaders, in their own words

The system isn't original work in the deepest sense. The patterns came from people who were generous enough to share what was working for them.

Andrej Karpathy gave us the 80/20 framing and the four upstream pitfalls. Both shaped pro-workflow more than any other single influence.

Boris Cherny (Claude Code's creator) has been steadily posting tactical advice that we've folded in directly. "Use subagents to throw more compute at a problem, offload tasks to keep your main context clean." "If you do something more than once a day, turn it into a skill or command." "Write detailed specs and reduce ambiguity before handing work off, the more specific you are, the better the output." Each of those is encoded somewhere in pro-workflow.

Lance Martin wrote the Write/Select/Compress/Isolate framework that became the basis for our context engineering skill and reference doc.

HumanLayer (Dex Horthy) shipped the Research-Plan-Implement workflow that became our /develop multi-phase pattern with validation gates between phases.

Addy Osmani writes long-form posts about LLM coding workflows from inside the Chrome team. His patterns on spec-first development and review checkpoints were direct inputs to our 80/20 review checkpoint pattern.

Thariq Shihipar made one specific observation that changed how we write skill descriptions: "Skill description field is a trigger, not a summary. Write it for the model." Every skill in pro-workflow now has a description optimized for trigger detection rather than human readability. The model picks up the right skill at the right time more reliably as a result.

obra shipped Superpowers, which gave us several patterns for advanced Claude Code customization that we adapted.

Trail of Bits published a security-focused Claude Code config that informed how we think about permission rules and dangerous-operation gates.

The point of listing these isn't credit for credit's sake. It's that pro-workflow is the integration of work that other people did first. Our contribution is the arrangement: taking patterns that work individually and arranging them so they reinforce each other instead of competing. The patterns are theirs. The system is the integration.

What we got wrong

Three things, in case anyone is following the same path:

We auto-captured everything in v1. Every correction went into LEARNED.md without approval. Within a week the file was a contradictory mess. Adding the approval gate dropped noise by ~90% and is non-negotiable now.

We trusted markdown for storage too long. v1 and v2 used CLAUDE.md sections for learnings. That worked up to ~30 rules. Past that, search became impossible and the file became too long for the agent to load efficiently. Migrating to SQLite + FTS5 in v3 was painful and should have happened sooner.

We over-indexed on hooks blocking things. Early hooks blocked aggressively. Users (us) learned to disable them. We rewrote the philosophy to be non-blocking by default, blocking only on genuinely destructive operations. Reminder hooks teach users to listen. Blocking hooks teach users to disable.

There are smaller mistakes too: writing skills that were too long for the model to load efficiently, building a permission system that was too granular, shipping a v2 with 18 hook events when 12 would have done the job. The pattern across all of them is the same: we added complexity in places where simplicity would have worked, and we paid for it in adoption friction. Karpathy's "minimum viable code" rule applies to workflow tools too.

Why this compounds

The patterns add up arithmetically. Each one saves time in its specific domain. The interesting effect is geometric: they reinforce each other in ways the individual patterns don't predict.

Self-correction generates rules. Rules are loaded by SessionStart hooks. Hooks change the agent's behavior in the next session, which produces fewer mistakes, which means fewer corrections, which means better-targeted rules. Pre-flight discipline catches the mistakes self-correction would have logged, which means LEARNED.md grows slower over time but with higher signal. Wrap-up captures session summaries that feed split memory. Split memory gives future sessions better starting context. Better context means less time burned reloading. Less time reloading means more time on actual work. More work means more corrections. More corrections means more rules. The loop closes and it tightens with each turn.

The number that matters isn't how much code the agent writes. It's how much your interaction with the agent improves over time. With no system, the answer is zero. The agent at month 6 behaves the same as the agent at month 1, because nothing carries over. With pro-workflow, the agent at month 6 has 100+ rules it didn't have at month 1, a wrap-up history it can resume from, hooks that catch its specific failure modes, a memory architecture that loads the right context for the project you're in, and a pre-flight discipline that prevents most of the mistakes you used to correct.

That's what compounding means in practice. Not better models. Better integration with the model you already have.

Where to start

You don't need everything on day one. The patterns are independent enough to adopt incrementally.

Week one: install pro-workflow, copy the minimal CLAUDE.md template, enable the self-correction loop. Stop here if nothing else. The keystone pattern alone is worth the install.

Week two: enable pre-flight discipline as an always-on rule. Start running /wrap-up at the end of every session. Both habits form fast and the friction is low.

Month one: split CLAUDE.md into AGENTS.md, SOUL.md, LEARNED.md. Enable the core hooks (SessionStart, Stop, PreToolUse on Edit). Notice which patterns the agent violates most often and let LEARNED.md grow.

Month two: adopt parallel worktrees for any work where you'd otherwise wait. Use /develop for features that touch more than five files. Try the LLM gate hooks for commit validation.

Month three: enable the full hook set (24 events). Run /permission-tuner and /mcp-audit. Translate your setup to other agents via SkillKit if you use more than one.

Month six and beyond: notice that you're correcting the agent less, that sessions resume cleanly, that you've forgotten what it was like to retype the same conventions every Monday. That's the compounding effect.

The full system is about 100 files. The minimal viable version is three. Pick your entry point.

Why this matters

Every six months a new model ships and the discourse pretends the previous one was useless. The actual bottleneck is rarely model capability. It's the systems we build around the models. A team running last year's model with a serious workflow will out-ship a team running the latest model with no workflow. We've watched this play out across enough projects to be sure of it.

The interesting work in AI coding isn't waiting for better models. It's figuring out how to get more out of the ones we already have. Pro-workflow is one answer. There are others (Superpowers, everything-claude-code, gstack, GSD, SpecKit). The mistake is having no answer at all.

The patterns work. The compounding is real. The cost of adoption is low. Start with the keystone, add the rest as you need it, listen to the leaders who are still figuring it out alongside you.


Pro-workflow is open source at github.com/rohitg00/pro-workflow, MIT licensed. It works with Claude Code, Cursor, and 30+ other AI coding agents via SkillKit. Patterns are adapted from Karpathy, Boris Cherny, Lance Martin, HumanLayer, Addy Osmani, Thariq Shihipar, obra, and the broader Claude Code community. Adapt freely, attribute generously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment