After watching Claude Opus 4.6 pwn the Facts machine in 22 minutes, I wanted to see if it was a fluke or a pattern. Same setup, different machine, no changes to the agent.

The target: Interpreter, a medium-difficulty Linux (Debian 12) machine on HackTheBox. This one was released just the day before, so there were no public writeups or walkthroughs to pull from. Claude was going in completely blind.

The tool: Claude Opus 4.6 running inside Claude Code v2.1.42 with the same htb-pentester agent. Between runs, I refined the CLAUDE.md system prompt a little, adding more context about methodology and tool usage. Nothing drastic, just small tweaks based on what I observed in the first run.

The result: Both flags captured in 19 minutes and 56 seconds. Even faster than the first run.

Claude Opus 4.6 autonomously pwning the Interpreter machine

The Numbers

Metric	Value
Total Time	19 minutes 56 seconds
Tool Calls	112
Tokens Used	102,600
Human Intervention	Zero
Flags Captured	2/2

112 tool calls in under 20 minutes. That's roughly 5.6 actions per minute of continuous exploitation with no breaks, no confusion, and no human guidance.

Compared to the Facts machine:

Metric	Facts	Interpreter
Time	22m 30s	19m 56s
Tool Calls	132	112
Tokens	108,200	102,600

Fewer tool calls, fewer tokens, faster completion. The agent was more efficient this time around.

What It Did

Claude worked through the full pentesting methodology again: reconnaissance, service enumeration, vulnerability identification, exploitation, post-exploitation, and privilege escalation. All autonomously.

This machine had a healthcare integration engine exposed on the web. Claude fingerprinted the exact version through an API endpoint, identified a known pre-auth RCE, gained a shell, and started digging through config files. It pulled database credentials, enumerated internal services listening on localhost, and discovered a root-owned Python application accepting user input through the healthcare pipeline.

The approach was surgical. Instead of brute forcing its way through, it read config files, followed credential trails across services, and mapped out internal infrastructure that wasn't visible from the outside.

What Stood Out This Time

It Handled Failed Exploits Gracefully

During the attack, Claude attempted to use Metasploit for the initial foothold. The Metasploit module failed with exit code 144. Without hesitation, it pivoted to a direct exploitation approach targeting the same CVE through the application's API. The alternative path worked. No wasted time asking for help or getting stuck in a loop.

Parallel Task Management

Just like the first run, Claude spawned background tasks for service enumeration while the main exploitation chain progressed. Multiple port scans and service checks ran concurrently. By the time the main thread needed that information, the background tasks had already completed.

Organized Evidence Collection

This time I asked it to create a dedicated folder structure for the target. It immediately set up organized directories for nmap scans, loot, exploits, and screenshots before even starting the pentest. It saved the user flag to a text file the moment it got user access, exactly as instructed. After completion, all findings were documented in a structured notes file with the full walkthrough.

It Found a Creative Privilege Escalation

The path from user to root wasn't a standard GTFOBins entry or simple sudo misconfiguration. Claude discovered a Python application running as root on an internal port. It probed the application, noticed error messages leaking implementation details, and identified that user-controlled input was being passed into a dangerous evaluation context. It crafted a payload that exploited this to execute arbitrary code as root. That's the kind of finding that requires understanding application logic and recognizing subtle code injection patterns, not just running sudo -l.

The Pattern

Two machines. Two clean runs. Both under 23 minutes.

Machine	Difficulty	Time	Tool Calls	Result
Facts	Easy	22m 30s	132	Pwned
Interpreter	Medium	19m 56s	112	Pwned

The consistency is what's interesting. It's not randomly stumbling into flags. The agent follows a structured methodology every time: scan, enumerate, identify vulnerabilities, exploit, escalate. When something fails, it adapts. When it finds credentials, it reuses them. When it discovers internal services, it investigates them.

Still Unanswered

The two machines so far have been Linux targets with web-facing services. The real test will come from:

Windows / Active Directory environments where the attack surface is completely different
Hard and Insane rated machines where the path isn't straightforward
Machines with rabbit holes that waste time on false leads
Custom exploits where no public CVE exists

But two for two is a strong start. The agent isn't just running tools. It's thinking through attack chains, adapting when paths fail, and doing it all faster than most humans.

Machine: Interpreter (HackTheBox) · Difficulty: Medium · OS: Linux (Debian 12) · Status: Pwned Tool: Claude Opus 4.6 · Claude Code v2.1.42 · Fully Autonomous Agent