Yes to Autonomy and Control

A Decision Record Workflow for AI Coding

May 11, 2026

This article is about the origins of Sundial, a tool for improving control over autonomous agents using Decision Records.

I had been struggling with autonomy and control for AI coding agents. More autonomy is attractive because then I up my productivity multiplier. More control is good because I am a better engineer than these agents, and their decision-making is suspect. Also, agents are only as good as their context; with the perfect prompt, which outlines exactly what we want done, they are getting better and better at doing it pretty well.

Here is how frustratingly control works: On the one hand, I am clicking “yes” on nearly every bash command the agent seems to need my approval for, while the “no’s” weren’t even an approval step, but me noticing their reasoning stream going off the rails. It just isn’t scalable to watch agent reasoning streams as if I am monitoring the Matrix. At the same time, I do love control; I’m a bit of control freak. I am a staff-level engineer with 20 years of experience, and I know how to do this software thing.

My Status Quo in the Before Times (a month ago)

A stack of context files: CLAUDE.md, AGENTS.md, CONVENTIONS.md, UI-DESIGN.md, ARCHITECTURE.md, IMPLEMENTATION-GOTCHAS.md, TESTING_STRATEGY.md and so on. The pattern was that if the agent did something I didn’t want, I (hopefully remember to) add a rule to some instruction file (or I don’t because it isn’t systematized, it’s best effort on my part). These files are referred to by agent skills doing things in that domain. Long term, if I come back to an older project, I need to reload the project-specific organization of all of this context; a big switching cost.

What I Built

I’m calling it Sundial. It’s a node CLI tool (npm install @arcridge/sundial) plus a VS Code plugin. The decision records live in .sundial/drs/as markdown files along-side the code, partitioned by status (candidates/, accepted/, rejected/, retired/) and tagged. It integrates with Claude Code and Codex CLI as an agent skill (sundial init) and a managed section of CLAUDE.md / AGENT.md so the agent retrieves precedent before consequential coding and writes candidate records when it confronts a choice without one.

When I am working with Sundial, agents retrieve relevant decision records (more on this below) and consider which are in scope for the design or implementation underway. They know how to create new decision records, and there is a meta-prompt pushing them in the direction of doing so. After DRs have been proposed I can review them in a VS-Code sidebar, either accept or reject, and I can revisit the repository existing DRs and manage them through the sidebar (or any text editor) at any point.

The plain-text substrate means any other UI is straightforward to build. I’m sure an agent could do it in an hour (and more will likely follow in the open source repo) and vim works just as well. There’s no MCP server, no embedding index, no cloud service.

Community Ideas

There’s a recent ETH Zurich study (cited in the Augment Code AGENTS.md guide) that found bloated, auto-generated context files actually reduced task success rates by a few percentage points and added 20%+ to inference cost. This anti-pattern has been named alternatively “markdown museum for confused bots,” “Decision Documentation Theater,” “SDD fatigue,” and “ball of mud” for context files. Skills come to rescue here in terms of better conventions around compartmentalization and discoverability, Sundial compliments this approach.

A paper recently came out in March 2026 (arXiv 2603.15566) that calls this the “Decision Shadow”: each commit captures a code diff but discards the reasoning, and the constraints and rejected alternatives that shaped the decision are gone. That paper’s prescription is to repurpose git commit messages as the decision record, which is a cool idea as well, Sundial pulls this type of decision into a “latest state” of all decisions type snapshot and puts the human engineer in the driver’s seat.

The memory infrastructure side of the space, such as Mem0, Zep, and Letta for example, are precision-optimized retrieval layers for long conversational histories. I also want continuity across sessions, but I am after governed precedent within a project, and I want the precedent to be plain text an engineering team can read, edit, and version-control. The infrastructure approaches felt heavy and they hid the thing I most wanted to see and manage myself: what is the agent being told?

My Principles

There are many small but important decisions I’ve made in assembling this solution. With a few rounds of iteration, I came to the Sundial approach by combining some practices and patterns from my own latent space.

The act of correcting the agent should automatically produce a durable artifact. Any decisions and redirects for a given session / task should be remembered.

The unit isn’t exactly an architecture decision record (ADR) but it shares its DNA. ADRs are culturally coded as rare and architecturally significant, twenty to fifty over a project’s lifetime, written by humans, reviewed by committee. What I wanted was the agent-scale endpoint of that pattern: hundreds of decisions, including implementation patterns and conventions, not just architecture. I’m not the first person to want this. MADR has been drifting toward “Any Decision Records” since 2022 and recommending a decisions/ folder; repurposing this originally human-targeted record and optimizing for agents.

The review surface should scale to agent-scale volume. So the review surface looks more like an inbox. Candidate goes in; I accept, reject, or retire later. The filesystem layout (candidates/, accepted/, rejected/, retired/).

The artifact should be terse, because the agent will load it into context. Many small dense records load cheaper than a few long ones. Anthropic’s recent Skills authoring guidance pushes the same direction — short, structured, retrieval-anchored — and there’s a paper called EASYTOOL (arXiv 2401.06201) on transforming verbose tool documentation into “unified and concise tool instructions” that found the same dynamic for tool descriptions. Whatever is in the record should be “surprising” to the LLM.

The artifact should also contain rationale for humans, but in a separate section. Humans want the “why” for review, for onboarding, for the next staff engineer who joins the project two years from now. This stuff isn’t included in the prompt most of the time; the LLM doesn’t need it.

I Expect Retrieval to Be the Hardest Problem to Get Right

So far I have only let my DRs grow to be around 100 for my personal projects, and this is easily manageable. But if there is a place for this solution to fall down, it is in this area. Here is how I have biased the initial solution:

Retrieval should favor recall over precision. The argument: at current long-context performance, pulling a slightly wider slice of the decision corpus and letting the model triage is cheaper than having a sophisticated index, tuning it, and risking missing a relevant record on a score threshold. There’s a Google paper from January 2025 (”Is Long Context All You Need?”, arXiv 2501.12372) that makes this argument explicitly for NL2SQL: unless retrieval is highly accurate, increasing recall by including more information — even at the expense of lower precision — is a beneficial strategy.

Decision context should be appropriately scoped. A decision about JWT-vs-OAuth should be in scope when I’m working on auth, but not when I’m working on the database schema. The cleanest answer I found without sophisticated retrieval was a domain tree — backend.auth, for instance — with inheritance up the tree (ancestors give general guidance) and down the tree (descendants give more specific precedent), but no inheritance from siblings or collaterals. And a DR can be scoped to all domains as an option. How well this works in practice depends on a few key assumptions and practices (1) the LLM does a good job of proposing the appropriate domain(s) query time and DR proposal time (2) the human understand the implication and importance of domain assignment at review time, and collectively the LLM and human create the “right” domain taxonomy.

A Quick Survey of Related Work

Decision records as a genre go back to Michael Nygard’s 2011 essay. The genre-broadening I described above is mirrored in the MADR project. Cline Memory Bank and Roo Code Memory Bank have a decisionLog.md as one of about five core files. Spec Kit has gated phases for feature specs. Skills I think are the closest pattern for “LLM-native artifact with structured metadata and progressive disclosure,” but Skills are procedural (”how to do X”) rather than decision-shaped. Mem0, Zep, and Letta optimize for precision, cross-session continuity, and infrastructure-as-a-service, and maybe this sophistication would help this project, but isn’t integrated yet.

Feel Free to Try It

I’ve used Sundial on a handful of projects. That’s all the data I have. I’d love it if others were interested and tried it out. The project is at an MVP implementation with so many open questions.

How do things scale with the corpus broadens? How model-dependent is the terseness invariant? If CODEX and CLAUDE have different assumptions in the same space, one might need more explicit guidance than the other.

How well does the LLM actually route through the domain tree? How often does it miss precedent? Can we audit and automate the auditing process? Can we measure?

What does review fatigue look like at scale? The candidate inbox solves one problem (PR-review doesn’t scale to agent-volume) but introduces another (someone still has to triage candidates). Does the LLM over-or-under propose new DRs?

How does it hold up on older, larger, weirder projects? How hard is it to retroactively create the DR store? Do we need to? Everything I’ve used it on are greenfield projects where everything is already in my head.

Ben Jackson

Discussion about this post

Ready for more?