ADHD: Fixing Premature Convergence in Coding Agents with Parallel Divergent Reasoning
"The ADHD GitHub README confirms the project's positioning, the npm package adhd-agent, the MIT license, the two-phase mechanism, the eval results table, and the install commands."
"The ADHD how-it-works doc confirms the Diverge/Focus two phases, isolated branches, O(N) token cost, and the concurrency semaphore."
"The ADHD vs CoT and ToT doc confirms the structural differences from Chain-of-Thought and Tree-of-Thought and the three load-bearing differences."
"The ADHD when-to-use doc confirms the use and don't-use lists, the cost figures, and the decision-point positioning."
"The New Stack feature on Claude Code ADHD confirms third-party coverage and ecosystem adoption."
Give an agent an open-ended problem and it has one predictable flaw: it lands on the first plausible plan and then keeps writing down that one path. The README has a real example that shows this. A CLI calls an LLM and occasionally hangs for 90 seconds, and the agent is asked to design a retry and timeout strategy. A single-shot agent gives a very standard answer: a 15-second first-token timeout, a 30-second between-token timeout, a 90-second hard cap, plus one automatic retry, citing chapter 22 of the Google SRE book. Nothing wrong with it, but it never once asks whether the wrong model was picked for this problem, where the slow call should not be retried at all but re-run against a faster model.
That is the gap ADHD is built to fill. It is the npm package adhd-agent, MIT licensed, built on the Claude and Codex Agent SDK. It fans a single problem out across several isolated cognitive frames in parallel into dozens of ideas, then uses one separate critic call to score, cluster, prune the traps, and deepen the best few. In this local-LLM series it plays an unusual role: it does not solve where your model runs, it solves whether your agent thinks widely enough at a key decision point.
The verdict first: it fixes premature convergence
Premature convergence is a structural problem in autoregressive models. The model generates one token at a time, and each new token is pulled along by what it has already written. So its first plan tends to be the most typical, most textbook one in the training data. That answer is usually correct, but usually unoriginal, and worse, it is often a trap that looks right precisely because it is familiar.
When does this actually bite you? On architecture decisions like storage layering, sharding, auth models, and queue topology; on naming functions, products, and environment variables; and on a fuzzy bug with no clear root cause, where you first need to list a few classes of hypotheses. What these share is that there is no standard answer, and missing the non-obvious but viable option can cost you a rewrite months later.
The opposite case: anything with a clear answer should not use it. Looking up how to call an API, fixing a bug whose root cause you already found, anything one search away, all waste money and time here. A one-line test: if a junior would Google it and find the answer, let the baseline answer; the moment a senior would stop and say “I need to think about this differently for a minute” is when ADHD earns its place.
The mechanism: a hard wall between two phases
ADHD is a two-phase loop with a hard partition between the phases. The author stresses that mixing divergence and review is exactly what destroys idea quality, because the critic strangles the generator on the spot.

Diverge: N branches that cannot see each other
The first phase picks N cognitive frames and fires N concurrent Agent SDK calls, each a fresh, stateless session. Each branch sees only three things: the original problem, one frame’s vantage prompt, and a system prompt that forbids evaluation and ranking. The key is that the branches cannot see each other. The branch running the “regulator” vantage never reads what the “speedrunner” branch wrote. With no shared context, anchoring is not suppressed by prompting; it simply does not exist by construction.
Focus: one separate critic call
The critic comes online only in the second phase and does three things. First it scores every idea on novelty, viability, and fit from 0 to 10, and tags each trap with a mechanistic reason, such as “shelve is not thread-safe under multi-writer load,” rather than a vague “this is risky.” Then it clusters by underlying angle instead of surface keywords, so you can see the shape of the whole design space. Finally it deepens the top-K, 3 by default, producing a sketch, the load-bearing risk, the first concrete step, and 3 to 5 child ideas.
There is a design choice that is easy to miss here: the split between generator and critic is mechanical, built from two different API calls with two opposite system prompts, not a promise made to the same session to “diverge first, then converge.” The isolated-branch call roughly looks like this:
const branches = await Promise.all(
frames.map(frame => withSemaphore(concurrency, () => callLLM({
systemPrompt: `${frame.vantage}\n\nFORBIDDEN: evaluation, ranking, hedging. JSON array out.`,
userPrompt: `${problem}\n\n${context ?? ""}`,
})))
);
Token cost grows linearly with the number of branches rather than quadratically, because later branches never re-read what earlier ones wrote. Concurrency is governed by a semaphore, 4 by default.
What you get from a run is not a wall of prose but a structured result: the full clustered idea set, a shortlist of 2 to 4 candidates, one explicitly flagged “non-obvious but viable” pick, a trap list with a reason for each trap, the deepened branches, and one wildcard provocation. Back on that retry problem, the single shot only gave a textbook hybrid; the option ADHD surfaced out of 30-plus ideas was “a button that pulses hotter the longer you wait, one click cancels and re-submits to a faster Haiku-class model,” while it flagged ideas that sound fun but are traps, like “stream tokens in reverse” and “patience-token billing,” before they cost engineering time.
How it differs from Chain-of-Thought and Tree-of-Thought
These three are the easiest to conflate, but they are structurally different.
| Dimension | Chain-of-Thought | Tree-of-Thought | ADHD |
|---|---|---|---|
| Threads | one, linear | one tree, walked | N parallel, isolated |
| Branches share context | yes | yes (one session) | no, each branch is its own query |
| Generation vs review | same step | same model, alternating | separate phases, separate calls, opposite postures |
| Branching driver | none | next-step variations | cognitive frames, re-asking the whole question |
| Parallelism | sequential | mostly sequential | true concurrency |
| Right for | math, multi-step logic | search, planning, puzzles | open-ended design and ideation |
Three load-bearing differences are worth pulling out. First, it is isolation, not search. CoT and ToT branches share one context window, so by step four the model is anchored on its first three steps; ADHD branches cannot see each other during divergence, so anchoring is eliminated by construction. Second, it varies the frame, not the next step. ToT branches usually vary the next move (try this number, try that one), while ADHD varies the vantage on the whole problem, in effect asking the model to “re-ask this as a hardware problem,” which produces structurally different ideas rather than nearby tweaks. Third, the generator-critic split is mechanical rather than promised.
One clarification: ADHD is itself a Tree-of-Thought variant, and the deepen phase does expand top-K nodes. What is new is that branching is driven by frames, and that the review split rests on two genuinely separate calls.
What a frame is: changing the vantage on the whole problem
A frame is not role-play. It is a vantage operator, a system prompt that re-poses the entire problem from a particular cognitive position. Unlike persona-prompt research with its “you are a 34-year-old engineer named Alex,” a frame does not make the model play a character; it forces the model into a thinking corner it would not naturally drift toward.
The project ships 15 frames, and with codeMode on (the default) it biases toward engineering vantages. A few concrete ones: the hardware engineer thinks in latency, memory layout, and physical constraints; the 3am on-call thinks about “what design would keep me from getting paged”; the remove-the-load-bearing-assumption frame asks “what is possible if the framework, the database, and the network are all gone.” Each run picks frames deterministically by seed and always reserves one wild slot so divergence stays a little weird.
A few cross-domain frames are especially good at prying out transplantable ideas: biology draws on immune systems, neural plasticity, and cell signaling; logistics thinks in queues, batching, just-in-time delivery, and hub-and-spoke; game design thinks in loops, rewards, friction, save-states, and speedrun tricks. The genuinely good answer to an open problem often lives outside any single domain’s playbook and has to be transplanted from elsewhere, which is exactly why cross-domain frames exist.
Writing your own frame takes about 5 lines of code. A good frame meets at least two of three bars: vocabulary no other frame uses, a posture distinct from the others (adversarial, constructive, naive, maximalist), and a reproducible distortion that surfaces ideas the other frames miss. Just renaming the domain while saying the same thing does not pass.
When to use it, and when not to
Treat it as a decision-point tool, not a per-keystroke one. The table below is something you can apply directly.
| Scenario | Use ADHD | Reason |
|---|---|---|
| Architecture, sharding, auth, queue topology design | Use | open-ended, premature convergence is costly |
| API / SDK / CLI surface design and naming | Use | you need a non-obvious but viable option |
| Fuzzy debugging with no clear root cause | Use | you need several hypothesis classes first |
| Migration, refactor planning, widening a code review | Use | more angles surface traps earlier |
| Looking up an API or a doc | Skip | one search away, a single shot is faster |
| Fixing a bug with a known root cause | Skip | the answer is unique |
| Inner loops, per-keystroke, low-latency use | Skip | one run takes 30 to 90 seconds |
Be clear about cost. A default run is roughly 10 LLM calls: 5 divergence calls by default, plus 1 scoring, 1 clustering, and 3 deepen calls, which adds up to about 5 to 10 times a single shot, with 30 to 90 seconds of wall-clock time. The author’s framing is practical: spend around $0.30 to widen a $50k architecture decision, do not run it on every keystroke, run it at decision points. One honest caveat: inside a Claude Code session carrying a large CLAUDE.md and tool context, every branch re-loads that base substrate, so real token cost is closer to “branches times base plus branch,” higher than the bare algorithmic cost.
How to install and trigger it
Installing is one command that auto-detects your agent across Claude Code, Cursor, Antigravity, Codex, Cline, Gemini CLI, Windsurf, and about 50 more.
npx skills add UditAkhourii/adhd
Once installed, trigger it explicitly with /adhd "your problem", or let it auto-trigger on ideation intents. Codex has its own discovery path, so when the universal command fails to register, force the target:
npx skills add UditAkhourii/adhd -a codex -g
You can also copy SKILL.md into Codex’s skills directory at ~/.codex/skills/adhd/ by hand; after a restart, /adhd "design a rate limiter" routes through the skill. CLI and library installs exist too: npm install -g adhd-agent for the CLI, npm install adhd-agent to use it as a library.
Before installing any third-party skill, read its SKILL.md to see what it asks the agent to do, especially whether it calls external commands. The OpenClaw skill security review guide is a useful reference here; spending 5 minutes on the permission boundaries beats cleaning up afterward.
About local models, to be straight with you: ADHD is built on the Agent SDK and defaults to Claude-class models, so it is not a local out-of-the-box tool. Wiring in a local model such as one served by Ollama means writing your own adapter at the call layer, and the project does not promise that path is smooth. Its place in this local-LLM series is about its thinking at the agent-reasoning layer, not a claim that you can feed it straight to a small local model.
Conclusion
Treat ADHD as a tool you pull out at a decision point, not something that takes over your whole workflow. Its value is not “thinking more” but “thinking differently,” plus a separate review pass that calls out the traps. Run it once on an architecture decision you already have an answer for, compare the non-obvious option it returns against your own plan, and then decide whether to wire it into your agent loop.
To keep reading, see the 2026 AI coding tools panorama for where it sits in the tooling landscape, or the DeepAgents architecture breakdown for how sub-agents and planning tools organize reasoning.
Install and trigger ADHD in Claude Code or Codex
Install the ADHD skill with one command, then trigger parallel divergent reasoning at a key decision point with /adhd.
- 1
Step1: Universal install
Run npx skills add UditAkhourii/adhd. It auto-detects Claude Code, Cursor, Antigravity, Codex, Cline, Gemini CLI, Windsurf, and about 50 more agents and installs into the right place. - 2
Step2: Trigger the skill
Invoke it explicitly with /adhd "your problem", or let it auto-trigger on ideation intents such as architecture, naming, and fuzzy debugging. - 3
Step3: Codex-specific path
If the universal command fails to register inside Codex, run npx skills add UditAkhourii/adhd -a codex -g, or copy SKILL.md into ~/.codex/skills/adhd/ manually and restart Codex. - 4
Step4: Review permissions first
Before installing any third-party skill, read its SKILL.md to see what it asks the agent to do, especially whether it calls external commands.
FAQ
What is the difference between ADHD and Tree-of-Thought?
How much does an ADHD run cost, and how slow is it?
Does ADHD require Claude, or can I use a local model?
Which tasks is ADHD a good fit for?
Is an ADHD frame just role-play?
How do I install ADHD in Codex?
10 min read · Published on: Jun 8, 2026 · Modified on: Jun 8, 2026
Related Posts
guizang-social-card-skill: A Reusable Pipeline for Rednote Cards and WeChat Covers
guizang-social-card-skill: A Reusable Pipeline for Rednote Cards and WeChat Covers
Mnemo Local Memory Layer: Portable Recall for Ollama and Custom LLM Apps
Mnemo Local Memory Layer: Portable Recall for Ollama and Custom LLM Apps
vibecode-pro-max-kit: Specs, Memory, and Multi-Agent Workflow for AI Coding
Comments
Sign in with GitHub to leave a comment