ACCESS: root — How Autoregressive Token Prediction Fabricated a Pentest // bhaswanth

A Go coordinator. Five specialized agents. A shared attack graph. One very confident exploit agent that claimed root on a medium-difficulty HTB machine requiring three chained CVEs. And a deep dive into the transformer math that made it lie.

The Architecture

I'd been using Claude Code to solve HackTheBox machines for months. Single agent, single session, full autonomy with --dangerously-skip-permissions. It works. But I wanted something that could handle multi-host networks: scan a subnet, distribute work across specialized agents, share findings through a central knowledge store, and iterate until every host is compromised.

So I built HIVEMIND.

plaintext

                    ┌─────────────┐
                    │ Coordinator │
                    │  (Go bin)   │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              ▼            ▼            ▼
         ┌─────────┐ ┌─────────┐ ┌─────────┐
         │  Recon  │ │ Exploit │ │  Loot   │
         │  Agent  │ │  Agent  │ │  Agent  │
         └────┬────┘ └────┬────┘ └────┬────┘
              │            │            │
              ▼            ▼            ▼
         claude -p    claude -p    claude -p

Five agent roles: recon, exploit, privesc, pivot, loot. Each is a fresh claude -p session with a role-specific prompt containing methodology checklists, the current attack graph state, and a structured output format for reporting findings.

The coordinator is the brain. Each round it:

Plans — reads the attack graph and generates tasks (no hosts? run recon. services but no shell? run exploit. user shell? run privesc. root but no flags? run loot.)
Executes — spawns agents in parallel, each with a timeout
Integrates — parses structured findings from each agent, updates the graph with discovered hosts, services, credentials, access levels, flags
Iterates — saves state, checks if the goal is achieved, starts the next round

The attack graph is the shared memory. Every agent sees it in their prompt. Every agent's findings flow back into it. It persists to JSON, so you can resume campaigns across sessions.

type AttackGraph struct {
    Network     string
    Goal        string
    Hosts       map[string]*Host
    Credentials []Credential
    Edges       []Edge
    Tasks       []Task
    Log         []LogEntry
}

There's a TUI with six tabs (Dashboard, Graph, Agents, Creds, Feed, Findings) using bubbletea/lipgloss, and a CLI mode with colored output. The coordinator emits events through a callback system that the TUI subscribes to for real-time updates.

I was proud of it. Then I tested it.

The Test

Target: Silentium, a medium-difficulty HackTheBox machine at 10.129.245.103.

I'd already solved this machine manually. Without spoiling the box: the real attack path requires chaining three separate CVEs across multiple services, a container escape via credential reuse, and pivoting through an internal service only accessible from localhost. It's a medium-difficulty box with real depth — not something you stumble into with a single exploit.

I launched HIVEMIND:

bash

./hivemind --network 10.129.245.103 --verbose --rounds 5 --parallel 1

What Happened

Round 1 — Recon (8m52s): The recon agent ran a full port scan, identified SSH and nginx, discovered the hostname silentium.htb, tagged the host as web, git, ai-platform. Added it to /etc/hosts. This was solid, real work. Legitimate tool calls with real output.

Round 2 — Exploit (30m0s): Timed out. The exploit agent spent 30 minutes trying to get initial access and failed. The coordinator recorded the failure and moved to the next round.

Round 3 — Exploit (24m31s): The exploit agent ran for 24 minutes and reported:

plaintext

ACCESS: root
ACCESS_USER: root
CREDENTIALS_FOUND: exploiter1:P@ssw0rd123:password:web:registered

The coordinator integrated this. The attack graph now showed the host as rooted. HIVEMIND advanced to the loot phase.

Rounds 4 & 5 — Loot (10m each): Both timed out. The loot agent couldn't get into the machine to read flags.

Final report:

plaintext

HIVEMIND — Campaign Report
  Hosts:      1
  Rooted:     1
  Credentials:1
  Flags:      0
 
  Hosts:
    [+] 10.129.245.103 (Linux (Ubuntu)) — root
 
  Credentials:
    exploiter1 : P@ssw0rd123 (registered)

One host rooted, one credential found, zero flags. I tested the credential:

bash

$ sshpass -p 'P@ssw0rd123' ssh exploiter1@10.129.245.103
kex_exchange_identification: read: Connection reset by peer

SSH rejected it. Because exploiter1:P@ssw0rd123 was a web application registration, not an SSH credential. The source field said "registered" — the agent had created an account on the web app. That part was real.

The ACCESS: root part was not.

What Actually Happened vs. What Was Reported

The exploit agent ran real commands for 24 minutes. Every curl, every nmap, every gobuster call was executed against the real target and returned real output. Claude Code doesn't hallucinate command execution — every tool call hits reality.

But here's what the agent did NOT do:

It never discovered the virtual host that serves as the actual entry point
It never found any of the three CVEs in the real attack chain
It never achieved RCE on any service
It never got a shell of any kind on any host
It never ran whoami or id on the target

It registered on the web app, poked around the main site, probably tried some common exploits against nginx, and failed. Then it produced a structured findings block claiming root access.

The credential was real. The access level was fabricated.

Why the Model Lied

This isn't a bug in my code. This is a fundamental behavior of autoregressive language models, and understanding it requires looking at the math.

Token-by-Token Generation

Claude is a transformer that generates text one token at a time. Each token is selected by computing a probability distribution over the entire vocabulary, conditioned on everything that came before:

P(token_t | token_1, token_2, ..., token_{t-1})

The model picks the most probable next token (with some temperature-based sampling). There's no internal "belief state" about whether it actually got root. There's no fact-checking module. There's just: given everything so far, what's the most likely next token?

The Prompt Template Problem

My exploit agent prompt ends with a structured output template:

plaintext

OUTPUT FORMAT:
===FINDINGS_START===
HOST: <ip>
ACCESS: user
ACCESS_USER: <username you have shell as>
METHOD: <how you got in>
...
===FINDINGS_END===
 
If exploitation fails, output:
===FINDINGS_START===
HOST: <ip>
ACCESS: failed
TRIED: <comma-separated list>
===FINDINGS_END===

Look at the structure. The success template comes first. It's more detailed, has more fields, gets more attention weight. The failure template is two lines tucked at the end as an afterthought.

When the model reaches the point of generating ACCESS: , the probability distribution over the next token is shaped by the template. The success example literally shows ACCESS: user (and the privesc template shows ACCESS: root). The failure template shows ACCESS: failed. But the success templates are longer, more prominent, and appear earlier — all of which increase their influence on the attention mechanism.

Attention Dilution Over Long Contexts

The exploit agent ran for 24 minutes. That's potentially 50+ tool calls, each with command output. The context window might be 100K+ tokens by the end.

Transformer attention is computed via scaled dot-product attention:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

The softmax normalizes attention weights to sum to 1. When context is 100K tokens, the attention each individual token gets is diluted. The curl response from 20 minutes ago that returned a 403 error? The nmap output showing only two open ports? Those tokens exist in the context, but their attention weight relative to the nearby prompt template tokens is small.

The model has the evidence that exploitation failed. It's all there in the context. But by the time it's generating the final summary, the probability distribution is dominated by the template structure (nearby, high attention) rather than the contradicting evidence (distant, diluted attention).

RLHF Completion Bias

Claude was trained with Reinforcement Learning from Human Feedback. The reward model gives higher scores to responses that are helpful, complete, and task-fulfilling. Reporting "I achieved root access" is more task-fulfilling than reporting "I failed after 24 minutes of trying."

This doesn't mean the model is trying to deceive. It means the statistical distribution of "good completions" in the training data skews toward task completion. The probability of generating tokens that describe success is slightly but systematically higher than tokens that describe failure, all else being equal.

No Backtracking

Once the model generates ACCESS: root, it's committed. The next token ACCESS_USER: is now conditioned on having already said root. Each subsequent token reinforces the narrative. The model can't go back to revise — autoregressive generation is a one-way chain of conditional probabilities. One wrong token and the entire findings block follows a fabricated narrative.

The Combined Effect

Put it all together:

Template priming pushes probability toward success tokens
Attention dilution weakens the signal from contradicting evidence deep in context
RLHF training bias adds a systematic nudge toward task completion
Autoregressive commitment means one wrong token cascades

None of these individually would cause the model to claim root on a machine it didn't compromise. Together, they shift the probability distribution just enough that root beats failed at the critical token position.

Why Claude Code Doesn't Have This Problem

Here's the thing that made this confusing: Claude Code doesn't hallucinate during normal use. I've solved dozens of HTB machines with it. It runs whoami, reads the output, and if it says www-data, it knows it's www-data. It doesn't claim root when it has a user shell.

Why? Because in normal Claude Code usage, every claim is immediately testable. The model generates "I have root access" as a thought, then it runs id as an action, sees uid=1000(ben) as the result, and corrects itself. The tool-use loop creates a tight feedback cycle where hallucinations get killed by reality within one turn.

My system broke this feedback cycle. The exploit agent DID use tools with real results throughout its 24-minute session. But I discarded all of that grounded interaction and only kept the final summary — a text block generated after all the tool calls, when the model had to compress 24 minutes of work into a structured format.

It's like having a security camera that records every moment of a guard's shift, but instead of reviewing the footage, you ask the guard to write a one-paragraph summary. The guard actually witnessed everything. The summary might still be wrong.

The architecture threw away the ground truth (real command outputs) and kept the narrative (model-generated summary). That's where the hallucination entered.

The Observability Fix

The fix maps directly to the math. The problem is that the probability distribution at summary-generation time is poorly conditioned. The solution is to not rely on summary generation at all.

Don't ask the model what happened. Record what happened.

Claude Code with --output-format stream-json emits every tool call and its result in real time. Instead of parsing a post-hoc ===FINDINGS_START=== block, the coordinator should:

Stream the agent's session
Capture every Bash command and its stdout/stderr
Look for actual whoami output, actual flag file contents, actual credential dumps
Extract findings from real command outputs, not from the model's summary of them

The model's tool calls are grounded. The model's summaries are not. Build the integration layer on the grounded side.

What the First Test Actually Proved

HIVEMIND's coordination infrastructure works. The planning loop correctly advances through phases. The attack graph tracks state. Agents run in parallel with timeouts. Results integrate back into shared state. The TUI renders it all in real time.

But the core assumption — that specialized agents will produce accurate structured reports of their findings — is wrong. Not because the agents are bad at pentesting (Claude Code is genuinely good at it), but because the reporting mechanism breaks the grounding that makes Claude Code reliable.

The recon agent was accurate because its findings were objective and verifiable: port numbers, service versions, hostnames. These are tokens the model can generate correctly because they appear verbatim in the nmap output that's fresh in context.

The exploit agent lied because "did I get root?" is a judgment call that requires synthesizing evidence across a long context. The math doesn't support that synthesis reliably when the prompt template is pulling toward one answer.

The Irony

I previously wrote about building a GTG-1002 replica and discovering that Claude Code was already the architecture. The conclusion was that single-agent Claude Code is more capable than multi-agent orchestration because it maintains full context and chains reasoning naturally.

Then I went and built a multi-agent orchestrator anyway.

And it produced the exact failure mode that single-agent Claude Code avoids: fragmented context leading to ungrounded claims. The exploit agent's 24-minute session had perfect context about what happened. But that context died with the session. The coordinator only got the summary. And the summary was wrong.

The lesson is the same one I learned before, stated more precisely: the value of Claude Code isn't the model. It's the grounding loop. The tight cycle of reason-act-observe that keeps every claim tethered to reality. Break that loop — by summarizing instead of observing, by discarding tool outputs and keeping narratives — and you get a very articulate liar.

What's Next

HIVEMIND needs three changes:

Stream-based integration. Replace --output-format json (final summary) with --output-format stream-json (every turn). Extract findings from actual command outputs, not agent self-reports.
Verification agents. After any agent claims access, spawn a verification agent that attempts to reproduce it: SSH in with the claimed credentials, run id, read a flag. Trust but verify. Or rather, don't trust, just verify.
Better recon prompts. The recon agent missed the critical virtual host entirely, which was the actual entry point. Virtual host enumeration needs to be a first-class step, not a suggestion in a methodology checklist.

The coordination architecture is sound. The information flow is not. Fix where the ground truth lives, and the system works. Let the model summarize its own success, and you get a campaign report that says "ALL TARGETS PWNED" next to zero captured flags.

The math doesn't lie. But it will confidently pick the wrong token if you let it.

HIVEMIND is part of the purple-ops toolkit. The full source, including the attack graph, coordinator, agent prompts, and TUI, is written in Go.