Engineering

Goal mode: the evaluator loop inside harnext

Type /goaland harnext stops being a single agent. A smart model plans the work and grades every result; a faster executor does the hands-on coding. Here's how that planner–generator–evaluator loop works — and which models to drop into each seat.

One agent has a ceiling

A normal harnext run is a single agent: one model reads the code, decides what to do, writes the edit, runs it, and judges whether it's finished — all in one context window. It is both the author and the reviewer of its own work. For a quick fix, that's exactly what you want.

On a longer, multi-step task it's also where things drift. The model loses the thread of its original plan, marks a half-finished step "done," or carries a subtle regression forward because nothing independent ever checked it. The author is too close to the work to catch its own mistakes.

The fix isn't a bigger model. It's a second opinion — separating the hands that write the code from the eyes that review it.

The pattern: planner, generator, evaluator

That separation has a name. The planner–generator–evaluator pattern splits an agentic task into three roles, usually filled by two models:

  • Planner — breaks the task into a concrete, ordered plan.
  • Generator (the executor) — does each step: reads files, makes edits, runs commands.
  • Evaluator (the supervisor) — checks each result against the plan and the goal. Pass, and the work moves on. Fail, and it goes back with notes.

People call this GAN-inspired for a reason. In a generative adversarial network, a generator proposes and a discriminator critiques, and the generator gets better precisely because it has to satisfy an adversary. Goal mode borrows the shape — not the training. The executor proposes a diff, the evaluator critiques it against the goal, and a rejected diff goes straight back for another pass. The pressure of an independent reviewer is what lifts quality, the same reason code review works on human teams.

The planner and the evaluator are usually the samesmart model wearing two hats. It wrote the plan, so it's the right judge of whether a step actually met it.

What /goal actually does

When a prompt starts with /goal, harnext runs the two-model loop instead of a single agent:

  1. Plan. The smart model reads the task and turns it into an ordered plan.
  2. Delegate. Step by step, it hands work to the executor.
  3. Build. The executor reads, edits, and runs code in the worktree to satisfy the step.
  4. Evaluate. The smart model checks the result — the diff — against the plan before it ever reaches you.
  5. Loop. If the smart model rejects the result, it goes back to the executor to fix, automatically. Only approved work surfaces.
/goal <task>PLANNER · EVALUATORSmart modelPlans & reviews the workclaude-opus-4-8GENERATOR · EXECUTORExecutor modelWrites & runs the codeclaude-sonnet-4-6delegate stepreturns the diff✗ reject → retry✓ approved diff → you

The executor does the token-heavy work; the smart model is invoked at the edges — once to plan, then once per result to review.

That division of labor matters. The executor does the grunt work — opening files, making edits, running tests — and burns most of the tokens. The smart model is invoked only at the edges: once to plan, then once per result to review. You see the diff after it has passed review, not before.

Tune it in harnext Desktop

In harnext Desktop the loop is yours to configure. Pick the model for the smart seat and the model for the executor seat independently — and because every call routes through the same provider layer, the two seats don't even have to be the same provider. Run Opus as the planner and a local model as the executor; pair a frontier reviewer with a cheap, fast coder; whatever fits the task and the budget.

The two seats have genuinely different jobs, so they want different models.

Which models go where

The smart seat — planner & evaluator

This is the judgment seat. It decomposes the task and — more importantly — has to catch the bug the executor missed. That's the hardest job in the loop, and it's where a stronger model pays off most. Default to Claude Opus 4.8 (claude-opus-4-8); reach for Claude Fable 5 (claude-fable-5) on the hardest, longest-horizon tasks. Run it at high or x-high effort and give it the full goal up front — frontier models plan best when they can see the whole task at once. (Goal mode is essentially that advice turned into a feature.)

The executor seat — generator

This seat needs to be a strong, fast coder — but it works inside a tight spec the planner already wrote, with a reviewer backing it up, so it doesn't need to be the smartest model in the room. Claude Sonnet 4.6 (claude-sonnet-4-6) is the sweet spot: quick, capable at edits and tool use, and a third of the planner's output price. For mechanical or high-volume edits, Claude Haiku 4.5 (claude-haiku-4-5) — or a local model — drops the cost further.

ModelFits the…ContextInput / Output
($ per 1M tokens)
claude-opus-4-8
Opus 4.8
Smart seat — default1M$5 / $25
claude-fable-5
Fable 5
Smart seat — hardest work1M$10 / $50
claude-sonnet-4-6
Sonnet 4.6
Executor — default1M$3 / $15
claude-haiku-4-5
Haiku 4.5
Executor — cheap / mechanical200K$1 / $5

Why this split is cheaper than it looks

Counterintuitively, putting your most expensive model in the loop can lower the bill. The planner runs a handful of times — one plan, one review per step. The executor runs constantly and burns most of the tokens. Pair an expensive reviewer with a cheaper executor and you spend the premium only where judgment matters, while the bulk editing runs on the cheaper seat. A smart reviewer also fails fast: it catches a wrong turn at step two instead of letting the executor build three more steps on a broken foundation.

Rule of thumb
Never make the evaluator weaker than the executor. A reviewer that can't out-think the coder just rubber-stamps bad diffs — and you've paid for a loop that does nothing.

When to reach for /goal

  • Multi-step, well-specified tasks— a migration, a feature with clear acceptance criteria, a refactor that spans files. The clearer the goal, the better the planner plans and the sharper the evaluator's verdict.
  • Skip it for quick local edits.A one-line fix doesn't need a plan-and-review loop; a single agent is faster and cheaper.
  • Write the goal like a spec."Make it faster" gives the evaluator nothing to check. "Cut p95 latency on /searchbelow 200ms and keep every test green" gives it a finish line.

Keep going


← All posts