Measuring a loop of AI spiral development

2026/05/22 artificial-intelligence agentic-coding

I finally got around to make any sort of measurement of how AI - based development (no code touching) works for me.

I find some enjoyment in seeing people track and measure things they build - so long as they are actionable. Enumeration of what one builds (which can later translated to how long it took to build) is crucial for any kind of external visibility. But with Pareto³ Law of agentic development the distance between what and how long become complex to say the least. So last weekend I decided to get the numbers on two cycles of my current build approach on a standalone app. If AI developement resembles a nautilus shell - but starting at the outmost chamber and iterating inwards - this exercise covered the first two chambers.

Context:

I implemented an full stack application to handle communication via API with generation capacity - reactive and proactive comms.
Think a Slack app with a configurable bot with external lookup that periodically prompts users for action based on the past interactions and the state of the external database.

Stack:

OpenAI's Codex for conceptualisation, initial spec builds, and implementation critique
Anthropic's Claude for spec critique (4.7 Opus) and majority of implementation (mostly Sonnet 4.6)
OpenCode backed by OpenRouter's Qwen3.6 35 A3B as a backup for when I run out of limits and credits.

Method:

Iterative human-in-the-loop spec development, including secondary model critique
I don't parallelise implementation (which may require revisit at some point)
Deriving phased todo lists for spec given the rest of the codebase
Phase-by-phase implementation (tests, minimal human comprehension between phases).
Optional: reset and re-implementation
Check of evals / verification / tests
Iterative, Pareto³-driven adjustment, regression tracking and incorporating omissions into the spec
Automatic review of the spec in the light of implementation, removal of TODOs
Commit and push

Steps and timing (roughly, immersive work):

Conceptualisation. Checked whether I actually needed to build it, what alternatives existed, and whether existing tools could already cover the use case. ChatGPT, 1.5 hours.
API feasibility check. Used ChatGPT in an extended thinking session, heavily relying on the llms.txt docs, to check API feasibility: authentication, message visibility, limits, read/write access, and likely blockers. 30 minutes. Scoping ends.
Initial backend spec. Generated an initial spec focused on authentication and backend structure. I used ChatGPT and Claude Opus against each other for critique and contrarian points, then manually inspected the result. This produced a single ~400-line markdown spec after about 3 hours.
Initial OAuth/API build. Used Codex to assist with the initial build and infrastructure for OAuth redirects, then manually verified API connectivity. Around 2 hours.
Phased backend implementation. Scanned the spec for actionable TODOs, split the work into phases, and implemented them with minimal manual adjustment. Around 2 hours.
Current tests inspection. Added tests, manually inspected the behaviour, and used automated improvements where useful. Around 1 hour. First loop ends.
UX spec. Generated a UX spec covering required interactions, user flows, and design. ChatGPT and Claude Opus for critique, 1.5 hours.
Merged UX and backend specs. Used Claude Opus to merge the UX spec back into the core spec and produce a phased UI task list. Around 30 minutes.
Initial UI implementation. Implemented the UI in phases using Claude Sonnet. Around 2 hours.
Rollback, scope reduction, reimplementation. Rolled back the first UI direction, simplified the scope, and pruned the UX spec and implementation list manually and with AI assistance. Around 2 hours.
Iterative UI/backend improvements. Continued improving the UI and related backend/API calls, dealing with regressions and the usual final-20-percent problems. Around 5 hours. Second loop ends.

Two 'chambers' already allow to observe the diminishing returns - both within loop and between the chembers.

Notes:

Secrets locally via .env (aiignore etc), verified in a central config module.
Deployed early on DigitalOcean to verify the production path: containerisation, environment variables, secrets, and deployment behaviour.
I tend to briefly check the direction of the tests, spec and some data models to capture and prevent rampant scope inflation and over-specification.
I use a good amount of persona - based verification (e.g. as SVP of UX evaluate and suggest improvements for..)
Periodically used Codex to update the spec based on changes, regressions, and omissions I found.
Commits on 'chamber' level. I'm not a frequent commiter, and this helps me track history.