deepagent-hermes on Terminal-Bench 2.0
Why bother
deepagent-hermes is a faithful reproduction of Nous Research’s Hermes Agent — closed reflection loop, agentskills.io-spec skills, frozen-snapshot memory, SQLite-FTS5 session search — built on top of LangGraph + deepagents. It runs locally, ships skills the agent can read and write at runtime, and has 442 passing tests for the moving parts.
What it didn’t have until this week was a number.
Tests verify wiring; they don’t tell you whether the agent can do anything useful. Terminal-Bench 2.0 is the closest external check — 89 tasks that exercise real terminal workflows (build a MIPS interpreter, recover a corrupted git repo, configure nginx, train a tiny model) inside ephemeral Docker containers. Every top agent on the leaderboard publishes its score there, so once hermes runs, it can be compared honestly against the field rather than against my own expectations.
This note covers what it took to wire the adapter, what the result looks like, and where the loop breaks.
What had to change
Harbor (the framework that backs Terminal-Bench) loads agents by import path. An agent is anything that subclasses harbor.agents.base.BaseAgent and implements an async run(instruction, environment, context) method. The only primitive Harbor offers the agent is environment.exec("...") — a coroutine that runs a shell command inside the task container and returns stdout/stderr/return code.
Hermes, by contrast, runs on the host. It owns its own filesystem tools (via deepagents’ BackendProtocol), its own SQLite checkpointer, its own iteration-counting reflection middleware. The mismatch isn’t surface-deep: hermes assumes synchronous file I/O against a local workspace; Harbor assumes the agent shells everything through one async primitive.
The bridge that works is a BaseSandbox subclass — HarborSandboxBackend — that implements deepagents’ sync execute() / upload_files() / download_files() and proxies each call through asyncio.run_coroutine_threadsafe(env.exec(...), loop). The agent runs on a worker thread (via asyncio.to_thread(graph.invoke, ...)); inside the agent, each tool call lands in our sync wrapper, which schedules the real env.exec coroutine back onto Harbor’s event loop and blocks for the result. The whole hermes middleware stack — reflection, memory, skills, FTS5 recorder, compression — runs unchanged. The container is just a different filesystem.
class HarborSandboxBackend(BaseSandbox):
def execute(self, command, *, timeout=None):
fut = asyncio.run_coroutine_threadsafe(
self._env.exec(command, timeout_sec=timeout or 300),
self._loop,
)
result = fut.result(timeout=(timeout or 300) + 30)
return ExecuteResponse(
output=(result.stdout or "") + (result.stderr or ""),
exit_code=result.return_code,
)Two non-obvious things bit during smoke:
LangGraph’s SqliteSaver is sync-only. await graph.ainvoke(...) against it crashes inside aget_tuple with NotImplementedError. The async checkpointer is a separate import (langgraph.checkpoint.sqlite.aio.AsyncSqliteSaver) that pulls in aiosqlite. Wrapping graph.invoke in asyncio.to_thread was simpler than adding a dep and a per-host switch.
Harbor hands the agent an AgentContext whose metadata field is None. The first crash logged through the error path immediately TypeError’d on context.metadata["error"] = .... Adding if context.metadata is None: context.metadata = {} at the top of run() is the whole fix.
The adapter is one file — examples/terminal_bench.py — about 300 lines including the docstring. The create_hermes_agent factory gained one optional kwarg, backend, so the bridge plugs in without forking the core.
What the run actually does
Each task is a Docker compose stack. Harbor brings it up, hands the agent the task instruction string (e.g. “In /app/source, write mips_interp.py so python mips_interp.py prog.asm runs the MIPS assembly and prints register state to stdout.”), waits up to a per-task timeout, then runs the task’s verifier (a separate verifier container with its own script) to score pass/fail.
A fresh HERMES_HOME is minted per task under /tmp/hermes-bench/. That keeps tasks isolated — no skill or memory bleed — but also means hermes can’t yet learn across tasks during a suite. The skill mutation log I shipped this week would let me change that later (warm-start each task from the best-performing snapshot of a previous run, audit-logged so I can see what carried over). For this first number, isolated is fairer.
Inside each task, the loop is exactly the deepagent-hermes loop:
- system prompt assembled fresh per turn (skill index + bundled toolset docs + frozen MEMORY.md/USER.md)
- model call (Anthropic Sonnet 4.5 via OpenRouter for this run)
- tool dispatch (file ops +
env.execshell, all proxied) - reflection middleware counts tool iterations; at 10, the review subagent forks with the same prefix cache and writes a SKILL.md if a pattern repeated
- iteration budget tightens after 10 user turns; the run terminates when the agent declares done or the budget runs out
The number
10-task sample from Terminal-Bench 2.0, Anthropic Claude Sonnet 4.5 (direct API), n-concurrent=4, 46 min wall, $4.84 in API.
| Metric | deepagent-hermes |
|---|---|
| Resolved | 2 / 10 (20%) |
| Resolved excluding adapter+infra failures | 2 / 7 (29%) |
| Total cost | $4.84 |
| Mean cost / task | $0.48 |
| Mean wall / task | 844 s (≈14 min) |
| Total tokens | 9.7M in / 91K out |
| Cache hit rate | 97.9% (9.5M of 9.7M input tokens were cache_read) |
Per-task breakdown
| Task | Resolved | Wall (s) | Cost ($) | Failure mode |
|---|---|---|---|---|
| break-filter-js-from-html | ✅ | 961 | 0.93 | — |
| build-pov-ray | ✅ | 1051 | 0.77 | — |
| circuit-fibsqrt | ❌ | 1535 | 2.58 | wrong answer (verifier said no) |
| compile-compcert | ❌ | 792 | 0.00 | credit exhausted (Anthropic billing) |
| distribution-search | ❌ | 1579 | 0.00 | credit exhausted |
| make-mips-interpreter | ❌ | 291 | 0.00 | adapter: argv too long on a large upload |
| overfull-hbox | ❌ | — | 0.00 | timeout (agent kept recompiling LaTeX) |
| path-tracing | ❌ | 991 | 0.00 | credit exhausted |
| protein-assembly | ❌ | 5 | 0.04 | model returned empty content, agent loop exited |
| video-processing | ❌ | 396 | 0.52 | wrong answer |
10 of 10 trials completed; 7 actually got to run the agent against the verifier. The other 3 hit Anthropic’s “credit balance too low” mid-suite — my account ran dry, not a hermes bug. One trial hit a real adapter bug (argv overflow on a >200 KB file write; fix — chunked upload). One model-side oddity: protein-assembly got an empty assistant response from Sonnet on the first turn and the loop exited five seconds in.
Where 20% lands on the leaderboard
Top of the public Terminal-Bench 2.0 leaderboard, sampled the same day:
| Rank | Agent | Model | Score |
|---|---|---|---|
| 1 | vix | Claude Opus 4.7 | 90.2% |
| 2 | JJAgent | Multiple | 87.1% |
| 6 | Codex CLI | GPT-5.5 | 82.2% |
| 13 | Meta-Harness (Stanford IRIS) | Claude Opus 4.6 | 76.4% |
| 16 | Capy | Claude Opus 4.6 | 75.3% |
| 19 | Terminus-KIRA | Claude Opus 4.6 | 74.7% |
| — | deepagent-hermes (this run) | Claude Sonnet 4.5 | 20% (2/10 sample) |
Three caveats before reading too much into the gap:
- Model gap matters more than harness gap. The frontier agents (#1, #6, #13) all use the strongest available model for their family — Opus 4.7, GPT-5.5, Opus 4.6. Sonnet 4.5 is a tier weaker; the right same-model peer on the leaderboard would also need to be running Sonnet, not Opus.
- Optimization gap matters too. vix and JJAgent ship months of harness-level tuning — tmux session management, scratchpad disciplines, output-truncation policies, prompt engineering. The deepagent-hermes adapter is v1 and runs with hermes defaults. Closing that gap is a separate project.
- The sample is 10 tasks, not 89. Two random successes in 10 puts the true rate somewhere in 5–50% with high confidence; calling it “20%” is a point estimate, not a precise number. A full 89-task run is the only way to put a real number next to the leaderboard, but my Anthropic credits ran out three trials short on this sample — see the failure table above.
What the number is good for: a public, reproducible reference point at v1. Anyone with the repo, an Anthropic key, and docker can re-run examples/run_terminal_bench.sh 10 and get a comparable result in roughly an hour.
What broke and what didn’t
A reproduction-of-an-existing-system benchmark answers two questions: did the architecture port faithfully? and what did the port surface that pure unit tests missed? This run answered both.
Adapter bugs caught by Harbor that 442 passing unit tests missed:
argvoverflow on largeupload_fileswrites.HarborSandboxBackend.upload_fileswas base64-encoding file content into a single shell command line. Linuxargvcaps around 128 KB; the agent tried to write a ~200 KB JavaScript file mid-task duringmake-mips-interpreter, thedocker execcall blew up withOSError(7) "Argument list too long", and the trial crashed after 291 s with no reward and no token totals. Fix: chunked write through a temp file when the base64 exceeds 64 KB.AgentContext.metadataarrives asNone. Harbor passes the agent a freshly-built context wheremetadataisn’t an empty dict butNone. The error path immediately TypeErrored when trying to log a crash, swallowing the actual error. Trivial fix (one-line init), but exactly the kind of integration-contract assumption a unit test doesn’t have a reason to challenge.SqliteSaverdoesn’t implementaget_tuple. The agent’s LangGraph checkpointer is the syncSqliteSaver.await graph.ainvoke(...)against it raisesNotImplementedErrordeep in the loop’s__aenter__. Every existing call site uses.invoke(), so no test caught it. The adapter routes throughasyncio.to_thread(graph.invoke, ...)— the agent runs sync on a worker thread; theHarborSandboxBackendbridges back to the event loop viarun_coroutine_threadsafeto callenv.exec.Cost-estimate unit bug. My adapter exported
HERMES_BENCH_COST_PER_INPUT_KTOKenv vars but used the value as if it were per-megatoken (divisor1_000instead of1_000_000), and charged all input tokens at the fresh-input rate instead of breaking out cache-reads. The first job dutifully reportedcost_usd = $30,575for a $4.84 run. Renamed to_MTOK, split fresh-vs-cached, life makes sense again.
Cost surprise (not strictly a bug, but a deployment trap):
The first smoke ran through OpenRouter. cache_read=0 on every model call: OpenRouter’s proxy strips Anthropic’s cache_control headers in the request body, so hermes’s ~30k-token system prompt (skill index + bundled toolset docs + frozen MEMORY.md/USER.md) paid full input price every turn. One task — make-mips-interpreter — burned 5 M input tokens (~$15.75) before finishing with reward 0.0.
Switching the same agent to direct Anthropic cut per-task cost from ~$15 to ~$0.48 — a ~30× drop. The next 10-task run hit 97.9% cache-read rate (9.5 M of 9.7 M input tokens), which is what hermes’s AnthropicCachingS3Middleware is supposed to deliver and what OpenRouter quietly broke. If you’re running long-prompt agents through a router, verify cache_read > 0 on the first turn before scaling.
What the reflection loop did:
Zero skill mutations recorded across all 10 trials. That’s consistent with the system’s design — the reflection middleware fires the review subagent at every 10 tool iterations, but the subagent only writes a SKILL.md when it sees a repeated procedure worth saving, and the per-task HERMES_HOME isolation means no procedure has a chance to repeat across tasks. Inside a single task, the same procedure rarely repeats either; the agent edits one file, runs one compile, observes one error. The closed loop closes only when the agent (or the dataset) shows up with multi-task arcs.
The audit log captured everything anyway. After the run, each task’s state.db is browsable via hermes audit log — message history, reflection invocations, tool-call timings — without re-running anything. That’s a useful side-channel for post-mortems that the leaderboard doesn’t surface.
What’s next
The interesting follow-ups are about the reflection loop, not the score:
- The full 89. Top up credits, fix the four bugs surfaced here (argv, metadata-None, async-saver, cost-unit), and run the full Terminal-Bench 2.0 dataset. That gives a real number to put on the leaderboard rather than a 10-task sample.
- Warm-start across tasks. The reflection loop produced zero skill mutations because per-task isolation gives nothing a chance to repeat. With the audit log, I can dump the best-performing skill snapshot from one run and load it as the starting state for the next — and ask: does the agent’s self-curated skill library on a fresh suite produce a measurably different score than running with zero skills? If yes, hermes’s value proposition has empirical support beyond “tests pass.”
- Same-model Terminus-2 comparison. Harbor ships Terminus-2 (a tmux-driven ReAct agent) as the reference. Running both Terminus-2 and deepagent-hermes on the same tasks with the same model isolates adapter and reflection-loop overhead from model strength — that’s the comparison that actually answers “did the architecture port well?”
Repo: https://github.com/dkedar7/deepagent-hermes. The adapter is at examples/terminal_bench.py; the launcher is examples/run_terminal_bench.sh.