deepagent-hermes on Terminal-Bench 2.0

LLMs
Agents
Benchmarks
LangGraph
Wiring a reflection-loop agent (deepagent-hermes) into Harbor and running it against the Terminal-Bench 2.0 reproduction benchmark. What the score actually measures, what fails, and how it compares to the leaderboard.
Published

June 4, 2026

Why bother

deepagent-hermes is a faithful reproduction of Nous Research’s Hermes Agent — closed reflection loop, agentskills.io-spec skills, frozen-snapshot memory, SQLite-FTS5 session search — built on top of LangGraph + deepagents. It runs locally, ships skills the agent can read and write at runtime, and has 442 passing tests for the moving parts.

What it didn’t have until this week was a number.

Tests verify wiring; they don’t tell you whether the agent can do anything useful. Terminal-Bench 2.0 is the closest external check — 89 tasks that exercise real terminal workflows (build a MIPS interpreter, recover a corrupted git repo, configure nginx, train a tiny model) inside ephemeral Docker containers. Every top agent on the leaderboard publishes its score there, so once hermes runs, it can be compared honestly against the field rather than against my own expectations.

This note covers what it took to wire the adapter, what the result looks like, and where the loop breaks.

What had to change

Harbor (the framework that backs Terminal-Bench) loads agents by import path. An agent is anything that subclasses harbor.agents.base.BaseAgent and implements an async run(instruction, environment, context) method. The only primitive Harbor offers the agent is environment.exec("...") — a coroutine that runs a shell command inside the task container and returns stdout/stderr/return code.

Hermes, by contrast, runs on the host. It owns its own filesystem tools (via deepagents’ BackendProtocol), its own SQLite checkpointer, its own iteration-counting reflection middleware. The mismatch isn’t surface-deep: hermes assumes synchronous file I/O against a local workspace; Harbor assumes the agent shells everything through one async primitive.

The bridge that works is a BaseSandbox subclass — HarborSandboxBackend — that implements deepagents’ sync execute() / upload_files() / download_files() and proxies each call through asyncio.run_coroutine_threadsafe(env.exec(...), loop). The agent runs on a worker thread (via asyncio.to_thread(graph.invoke, ...)); inside the agent, each tool call lands in our sync wrapper, which schedules the real env.exec coroutine back onto Harbor’s event loop and blocks for the result. The whole hermes middleware stack — reflection, memory, skills, FTS5 recorder, compression — runs unchanged. The container is just a different filesystem.

class HarborSandboxBackend(BaseSandbox):
    def execute(self, command, *, timeout=None):
        fut = asyncio.run_coroutine_threadsafe(
            self._env.exec(command, timeout_sec=timeout or 300),
            self._loop,
        )
        result = fut.result(timeout=(timeout or 300) + 30)
        return ExecuteResponse(
            output=(result.stdout or "") + (result.stderr or ""),
            exit_code=result.return_code,
        )

Two non-obvious things bit during smoke:

LangGraph’s SqliteSaver is sync-only. await graph.ainvoke(...) against it crashes inside aget_tuple with NotImplementedError. The async checkpointer is a separate import (langgraph.checkpoint.sqlite.aio.AsyncSqliteSaver) that pulls in aiosqlite. Wrapping graph.invoke in asyncio.to_thread was simpler than adding a dep and a per-host switch.

Harbor hands the agent an AgentContext whose metadata field is None. The first crash logged through the error path immediately TypeError’d on context.metadata["error"] = .... Adding if context.metadata is None: context.metadata = {} at the top of run() is the whole fix.

The adapter is one file — examples/terminal_bench.py — about 300 lines including the docstring. The create_hermes_agent factory gained one optional kwarg, backend, so the bridge plugs in without forking the core.

What the run actually does

Each task is a Docker compose stack. Harbor brings it up, hands the agent the task instruction string (e.g. “In /app/source, write mips_interp.py so python mips_interp.py prog.asm runs the MIPS assembly and prints register state to stdout.”), waits up to a per-task timeout, then runs the task’s verifier (a separate verifier container with its own script) to score pass/fail.

A fresh HERMES_HOME is minted per task under /tmp/hermes-bench/. That keeps tasks isolated — no skill or memory bleed — but also means hermes can’t yet learn across tasks during a suite. The skill mutation log I shipped this week would let me change that later (warm-start each task from the best-performing snapshot of a previous run, audit-logged so I can see what carried over). For this first number, isolated is fairer.

Inside each task, the loop is exactly the deepagent-hermes loop:

  • system prompt assembled fresh per turn (skill index + bundled toolset docs + frozen MEMORY.md/USER.md)
  • model call (Anthropic Sonnet 4.5 via OpenRouter for this run)
  • tool dispatch (file ops + env.exec shell, all proxied)
  • reflection middleware counts tool iterations; at 10, the review subagent forks with the same prefix cache and writes a SKILL.md if a pattern repeated
  • iteration budget tightens after 10 user turns; the run terminates when the agent declares done or the budget runs out

The number

10-task sample from Terminal-Bench 2.0, Anthropic Claude Sonnet 4.5 (direct API), n-concurrent=4, 46 min wall, $4.84 in API.

Metric deepagent-hermes
Resolved 2 / 10 (20%)
Resolved excluding adapter+infra failures 2 / 7 (29%)
Total cost $4.84
Mean cost / task $0.48
Mean wall / task 844 s (≈14 min)
Total tokens 9.7M in / 91K out
Cache hit rate 97.9% (9.5M of 9.7M input tokens were cache_read)

Per-task breakdown

Task Resolved Wall (s) Cost ($) Failure mode
break-filter-js-from-html 961 0.93
build-pov-ray 1051 0.77
circuit-fibsqrt 1535 2.58 wrong answer (verifier said no)
compile-compcert 792 0.00 credit exhausted (Anthropic billing)
distribution-search 1579 0.00 credit exhausted
make-mips-interpreter 291 0.00 adapter: argv too long on a large upload
overfull-hbox 0.00 timeout (agent kept recompiling LaTeX)
path-tracing 991 0.00 credit exhausted
protein-assembly 5 0.04 model returned empty content, agent loop exited
video-processing 396 0.52 wrong answer

10 of 10 trials completed; 7 actually got to run the agent against the verifier. The other 3 hit Anthropic’s “credit balance too low” mid-suite — my account ran dry, not a hermes bug. One trial hit a real adapter bug (argv overflow on a >200 KB file write; fix — chunked upload). One model-side oddity: protein-assembly got an empty assistant response from Sonnet on the first turn and the loop exited five seconds in.

Where 20% lands on the leaderboard

Top of the public Terminal-Bench 2.0 leaderboard, sampled the same day:

Rank Agent Model Score
1 vix Claude Opus 4.7 90.2%
2 JJAgent Multiple 87.1%
6 Codex CLI GPT-5.5 82.2%
13 Meta-Harness (Stanford IRIS) Claude Opus 4.6 76.4%
16 Capy Claude Opus 4.6 75.3%
19 Terminus-KIRA Claude Opus 4.6 74.7%
deepagent-hermes (this run) Claude Sonnet 4.5 20% (2/10 sample)

Three caveats before reading too much into the gap:

  • Model gap matters more than harness gap. The frontier agents (#1, #6, #13) all use the strongest available model for their family — Opus 4.7, GPT-5.5, Opus 4.6. Sonnet 4.5 is a tier weaker; the right same-model peer on the leaderboard would also need to be running Sonnet, not Opus.
  • Optimization gap matters too. vix and JJAgent ship months of harness-level tuning — tmux session management, scratchpad disciplines, output-truncation policies, prompt engineering. The deepagent-hermes adapter is v1 and runs with hermes defaults. Closing that gap is a separate project.
  • The sample is 10 tasks, not 89. Two random successes in 10 puts the true rate somewhere in 5–50% with high confidence; calling it “20%” is a point estimate, not a precise number. A full 89-task run is the only way to put a real number next to the leaderboard, but my Anthropic credits ran out three trials short on this sample — see the failure table above.

What the number is good for: a public, reproducible reference point at v1. Anyone with the repo, an Anthropic key, and docker can re-run examples/run_terminal_bench.sh 10 and get a comparable result in roughly an hour.

What broke and what didn’t

A reproduction-of-an-existing-system benchmark answers two questions: did the architecture port faithfully? and what did the port surface that pure unit tests missed? This run answered both.

Adapter bugs caught by Harbor that 442 passing unit tests missed:

  1. argv overflow on large upload_files writes. HarborSandboxBackend.upload_files was base64-encoding file content into a single shell command line. Linux argv caps around 128 KB; the agent tried to write a ~200 KB JavaScript file mid-task during make-mips-interpreter, the docker exec call blew up with OSError(7) "Argument list too long", and the trial crashed after 291 s with no reward and no token totals. Fix: chunked write through a temp file when the base64 exceeds 64 KB.

  2. AgentContext.metadata arrives as None. Harbor passes the agent a freshly-built context where metadata isn’t an empty dict but None. The error path immediately TypeErrored when trying to log a crash, swallowing the actual error. Trivial fix (one-line init), but exactly the kind of integration-contract assumption a unit test doesn’t have a reason to challenge.

  3. SqliteSaver doesn’t implement aget_tuple. The agent’s LangGraph checkpointer is the sync SqliteSaver. await graph.ainvoke(...) against it raises NotImplementedError deep in the loop’s __aenter__. Every existing call site uses .invoke(), so no test caught it. The adapter routes through asyncio.to_thread(graph.invoke, ...) — the agent runs sync on a worker thread; the HarborSandboxBackend bridges back to the event loop via run_coroutine_threadsafe to call env.exec.

  4. Cost-estimate unit bug. My adapter exported HERMES_BENCH_COST_PER_INPUT_KTOK env vars but used the value as if it were per-megatoken (divisor 1_000 instead of 1_000_000), and charged all input tokens at the fresh-input rate instead of breaking out cache-reads. The first job dutifully reported cost_usd = $30,575 for a $4.84 run. Renamed to _MTOK, split fresh-vs-cached, life makes sense again.

Cost surprise (not strictly a bug, but a deployment trap):

The first smoke ran through OpenRouter. cache_read=0 on every model call: OpenRouter’s proxy strips Anthropic’s cache_control headers in the request body, so hermes’s ~30k-token system prompt (skill index + bundled toolset docs + frozen MEMORY.md/USER.md) paid full input price every turn. One task — make-mips-interpreter — burned 5 M input tokens (~$15.75) before finishing with reward 0.0.

Switching the same agent to direct Anthropic cut per-task cost from ~$15 to ~$0.48 — a ~30× drop. The next 10-task run hit 97.9% cache-read rate (9.5 M of 9.7 M input tokens), which is what hermes’s AnthropicCachingS3Middleware is supposed to deliver and what OpenRouter quietly broke. If you’re running long-prompt agents through a router, verify cache_read > 0 on the first turn before scaling.

What the reflection loop did:

Zero skill mutations recorded across all 10 trials. That’s consistent with the system’s design — the reflection middleware fires the review subagent at every 10 tool iterations, but the subagent only writes a SKILL.md when it sees a repeated procedure worth saving, and the per-task HERMES_HOME isolation means no procedure has a chance to repeat across tasks. Inside a single task, the same procedure rarely repeats either; the agent edits one file, runs one compile, observes one error. The closed loop closes only when the agent (or the dataset) shows up with multi-task arcs.

The audit log captured everything anyway. After the run, each task’s state.db is browsable via hermes audit log — message history, reflection invocations, tool-call timings — without re-running anything. That’s a useful side-channel for post-mortems that the leaderboard doesn’t surface.

What’s next

The interesting follow-ups are about the reflection loop, not the score:

  • The full 89. Top up credits, fix the four bugs surfaced here (argv, metadata-None, async-saver, cost-unit), and run the full Terminal-Bench 2.0 dataset. That gives a real number to put on the leaderboard rather than a 10-task sample.
  • Warm-start across tasks. The reflection loop produced zero skill mutations because per-task isolation gives nothing a chance to repeat. With the audit log, I can dump the best-performing skill snapshot from one run and load it as the starting state for the next — and ask: does the agent’s self-curated skill library on a fresh suite produce a measurably different score than running with zero skills? If yes, hermes’s value proposition has empirical support beyond “tests pass.”
  • Same-model Terminus-2 comparison. Harbor ships Terminus-2 (a tmux-driven ReAct agent) as the reference. Running both Terminus-2 and deepagent-hermes on the same tasks with the same model isolates adapter and reflection-loop overhead from model strength — that’s the comparison that actually answers “did the architecture port well?”

Repo: https://github.com/dkedar7/deepagent-hermes. The adapter is at examples/terminal_bench.py; the launcher is examples/run_terminal_bench.sh.