Skip to main content
  1. Articles/

Credential-Blind Agentic Pentesting — Part I: Bidirectional Tokenization of Secrets, Identities and Topology

Elliot Belt
Author
Elliot Belt
I’m Felix Billières, pentester under the alias Elliot Belt. I do CTFs with the Phreaks 2600 team and I’m currently a Purple Teamer in internship. Passionate about Active Directory, web pentesting/bug bounty, and creating offensive and defensive tools.
Table of Contents

Credential-Blind Agentic Pentesting — Part I: Bidirectional Tokenization of Secrets, Identities and Topology
#

This is the first article in a new series. The goal of the series is narrow and, I think, important: build a pipeline that lets an LLM agent run real red team and blue team audits without ever leaking anything sensitive to the model provider. Not the credentials I hand it, not the credentials it discovers, not the hostnames, not the IPs, not the domains. And the property has to hold whatever provider sits behind the API, because in a real engagement I do not get to choose where the prompt ends up logged.

This part is the foundation. It establishes the problem, surveys what already exists, builds the core mechanism, and tests it against real machines. Later parts will push on multi provider validation, adversarial robustness, deeper attack chains, and the integration into a real tool. Everything here is reproducible, and where something failed I left the failure in.


Table of Contents
#

  1. Why I started this
  2. The threat model: the adversary is the recipient
  3. What already exists, and the gap
  4. The mechanism: bidirectional tokenization
  5. Building it: the vault, the detector, the proxy
  6. Experiment 1: detection on a real box
  7. Experiment 2: a real model in the loop
  8. Experiment 3: a provided credential, live
  9. Experiment 4: an autonomous agent in total substitution
  10. What we achieved
  11. What is still exposed, honestly
  12. Where the series goes next
  13. Sources

Why I started this
#

When you let an LLM drive a penetration test, you point a model provider’s logging, retention and training pipeline straight at the most sensitive data an engagement produces. The system prompt, the conversation history, and above all the tool output all flow to the provider. And tool output during a pentest is, by definition, full of secrets: password hashes, cracked passwords, Kerberos tickets, private keys, connection strings, plus the identities and the topology of the target.

People feel this, and they usually answer it with a policy: do not paste secrets into the chat. That is not an answer for an agent. An agent reads secretsdump output to decide what to do next. The secret is in the loop because the secret is the point.

So the question I wanted to answer is simple to state:

How do I let the model say “use the Administrator hash for a pass-the-hash over WinRM” and have that command actually run, when the model never saw the hash?

The rest of this article is the first serious attempt at that.

A note on scope. Everything here runs against retired or lab HackTheBox machines, in an authorized container, with credentials provided for the lab. The goal is never to root the box. The box is a generator of real secrets in a realistic environment, so I can measure leakage on something other than a toy fixture.


The threat model: the adversary is the recipient
#

The unusual thing about this problem is that the adversary is not an intruder. It is the legitimate recipient of the data: the model provider, and everything downstream of the API call.

AdversarySurfaceWhy it leaks
Model providerPrompt logging, retention, training, legal discovery, subprocessorsThe secret is in the payload
Provider breachLeaked logs or storesPrompts retained in clear
Third party MCP serverTool args and results pass through itA curious or malicious server sees them
Local logs and tracesDebug files, session exports, telemetryThe secret is serialized in the transcript
Prompt injectionA target convinces the model to exfiltrateThe model holds the value, so it can send it
SubagentsShared context propagates the secretInheritance

The Hermes Agent project put the principle in one line that I keep coming back to (NousResearch hermes-agent #410):

If an agent can see it, it can leak it.

There is now academic weight behind that sentence. Beyond Jailbreaking: Auditing Contextual Privacy in LLM Agents (arXiv 2506.10171) shows that agents leak information they legitimately hold through ordinary multi turn conversation, with no jailbreak required. The attacker asks innocent questions and the protected context comes out over several turns. Their conclusion is the design north star for this whole series:

Security cannot rely on explicit restrictions. Robustness requires isolating sensitive context at the architectural level.

That is the difference between telling the model “do not reveal the password” and making sure the password is not in the context at all. The first is a behavioral defense and it fails under pressure. The second is structural. If the value is absent, there is nothing to extract.

What I explicitly do not defend against in this part: a compromised host. If an attacker is root on the operator machine, the in memory mapping is readable and the game is over. I protect the flow to the provider, not a host that is already lost. I also do not try to hide secrets from the operator. The human runs the engagement and sees everything. It is the provider I am blinding.


What already exists, and the gap
#

I read the field before writing a line of code. Three families of solutions exist, and they are complementary but each incomplete for my case.

Zero knowledge injection. The secret never enters the agent context. The agent references a name, and resolution happens outside the agent at the transport layer. AgentSecrets is the clean expression of this: a local proxy injects real credential values into outgoing HTTP requests, with a domain allowlist, response body redaction, and an SDK that deliberately has no get() method. The slogan is honest: you cannot steal what was never there. The Hermes issue proposes the same idea for env vars, with an inject action that exposes a secret to the next command only. This is excellent for provided credentials, the API keys of the world.

Redaction and tokenization. The secret passes through the pipeline but is replaced by a typed placeholder before the model sees it, with a reverse map held in memory. LLM-Redactor (arXiv 2604.12064) evaluates eight techniques and shows that plain redaction (their Option B, regex plus NER) runs in under fifty milliseconds, leaks zero percent on structured PII, and actually saves tokens because placeholders are shorter than the originals. It also names the hard wall: implicit identity (role, relationships, organizational context) survives both redaction and rephrasing, because removing it destroys utility.

Defense in depth. Least privilege, scoping, short lived tokens, rotation, redaction at the IDE boundary. Knostic is the policy oriented version of this. It is orthogonal to the core and I treat it as a complementary layer, not the mechanism.

The general direction is well established. Doppler and the awesome-llm4privacy collection both converge on the same three ingredients: deterministic tokenization (same value gives the same token), consistent typed placeholders, and pre prompt redaction with secure resolution at runtime.

Here is the gap. Every one of these targets provided secrets, the ones you know in advance. Pentest is different in two ways that break the model:

  1. Most secrets are discovered, not provided. A hash in a dump, a password in a config, a ticket. They appear inside the tool output the model has to read. You cannot keep them from entering, they are already in the stream.
  2. The agent has to reuse them, in arbitrary later commands, not in a known HTTP header. It finds a hash and then wants to pass-the-hash with it.

So I need redaction for the discovered case (the model must not see the value) but it has to be bidirectional (the model must still be able to use it). Nobody in the literature covers that cleanly. That is the contribution of this part.


The mechanism: bidirectional tokenization
#

The core is a loop. A secret is detected on its way out to the model, replaced by a stable typed token, and the real value is stored host side. When the model later emits that token inside a command, the host substitutes the real value back before the command runs. Then the new output is scanned again, because tools love to echo the secret you just used.

   HOST (trusted)                                              PROVIDER (blind)
   --------------                                              ----------------
   tool output:  svc:1103:aad3b...:32196b...bf38:::
        |
        | (1) detect            32196b...bf38  (NTLM)
        | (2) intern            EXEGOL_SECRET_NTLM_1  ->  32196b...bf38
        | (3) redact
        v
   svc:1103:aad3b...:EXEGOL_SECRET_NTLM_1:::   ----------->   model reads tokens only,
                                                              reasons: "pass-the-hash with
                                                              EXEGOL_SECRET_NTLM_1"
                                                                    |
   exec("evil-winrm -u Administrator -H EXEGOL_SECRET_NTLM_1")  <---+  (5) model emits a command
        |
        | (6) resolve in argv  ->  evil-winrm -u Administrator -H 32196b...bf38
        | (7) run for real
        | (8) new output  ->  back to step 1 (re-redact)
        v

Five design decisions fall out of the literature and out of common sense:

  • Typed, stable placeholders. Typed so the model keeps utility (it knows this is an NTLM hash for a given account). Stable so the same value maps to the same token across the whole session, which keeps multi turn reasoning coherent. LLM-Redactor and Doppler both insist on this.
  • The reverse map is host only and ephemeral. It lives in memory. A crashed process cannot de redact. This is the same property the LLM-Redactor shim uses.
  • Layered detection. Known value match, then structured regex (keys, hashes, tickets), then tool output aware parsers, then a Shannon entropy net for the unstructured leftovers. Fail closed: when in doubt, redact. A false positive costs context, which the operator can recover. A false negative costs a secret, which is gone.
  • Provided secrets get interned once. A credential the operator hands me does not need detection. I know its value, so I intern it at session start and it propagates everywhere by exact match.
  • An adjustable aggressiveness axis. Credentials are always redacted. Topology (domains, hostnames, IPs) is the implicit identity wall from LLM-Redactor, so it is configurable. For this article I push it all the way to total substitution because that is what the property demands, and I wanted to see if utility survives. It does, mostly, and I will show where it does not.

On the token format, I made one bet and tested it. I use EXEGOL_SECRET_NTLM_1 and EXEGOL_HOST_1, in the shape of an environment variable. The hypothesis was that models reproduce ALL_CAPS_SNAKE identifiers more faithfully than the rare Unicode brackets the paper suggests, because they treat them as code identifiers. That hypothesis held in every run below.


Building it: the vault, the detector, the proxy
#

The proof of concept is plain TypeScript, no dependencies, runnable with npx tsx. It is deliberately separate from the extension I want to eventually build this into, so the research stays clean.

The vault holds the mapping and does the substitution in both directions. It is idempotent by hashing the value, which is what gives stable tokens.

intern(value, type, opts) {
  const h = sha256(value);
  const existing = this.byHash.get(h);
  if (existing) return existing;               // same value -> same token (coherence)
  const token = this.fmt(type, n, opts.label); // EXEGOL_SECRET_NTLM_1 / EXEGOL_HOST_1
  this.byToken.set(token, { value, type, label, provenance });
  this.byHash.set(h, token);
  return token;
}

redactKnown(text) {                            // longest values first, no substring leak
  for (const { value, token } of this.knownValuesByLengthDesc())
    text = text.split(value).join(token);
  return text;
}

restoreText(text) {                            // token -> real value, host side only
  for (const [token, entry] of this.byToken)
    text = text.split(token).join(entry.value);
  return text;
}

list() { /* tokens, types, labels, provenance, but never the value */ }

The detector is layered. The interesting part is that the strongest layer is not a clever regex, it is propagation: once a value is known, every later occurrence dies by exact match. The regex and entropy layers exist only to make a value known the first time.

// 1. already known to the vault         -> recall 100% on the known
// 2. structured regex (AD + dev/cloud)  -> NTLM, krb5, PEM, JWT, AKIA, gh_, connection strings
// 3. tool-output-aware parsers          -> hashcat potfile <hash>:<plaintext>, secretsdump lines
// 4. Shannon entropy (opt-in)           -> the unstructured leftovers, off by default

The proxy is the choke point for total substitution. It adds entity detection (IP, FQDN, DOMAIN\user) on top of secret detection, and it derives short forms so a NetBIOS name does not slip past a fully qualified one. The two methods that matter:

redactInbound(text) {                  // everything heading to the model
  detectSecrets(text).forEach(intern);
  text.matchAll(IP_RE).forEach(m => intern(m[0], "host"));
  text.matchAll(FQDN_RE).forEach(m => intern(m[0], "domain"));
  text.matchAll(DOMAIN_USER_RE).forEach(m => { intern(m[2],"user"); intern(m[1],"domain"); });
  text.matchAll(NAME_FIELD_RE).forEach(m => intern(m[1], "host"));   // (name:DC01)
  text.matchAll(NXC_USER_RE).forEach(m => intern(m[1], "user"));     // nxc --users table
  return vault.redactKnown(text);      // propagate, longest first
}

resolveOutbound(text) { return vault.restoreText(text); }  // commands from the model

One principle runs through all of it and it cost me a rewrite: never hardcode an IP, a domain, a credential or an account. The only thing known at session start is what the operator provides. Everything else is discovered at runtime and substituted on the fly. Hardcoding an entity tunes the tool for today’s target, which is the exact opposite of the goal. I will come back to why this bit me.


Experiment 1: detection on a real box
#

First box: HTB Forest, a Windows domain controller. Its public walkthrough gives me three ground truth secrets: the AS-REP roastable hash of svc-alfresco, the cracked password s3rvice, and the Administrator NTLM hash recovered through DCSync. I built a realistic combined tool output containing all three and measured.

I measure four things: recall (is the secret absent from the outgoing text), exact leak (does it appear verbatim), partial leak (does a substring of four or more characters survive, the LLM-Redactor metric), and false positives (did I redact a username I should have kept).

ConfigRecallExact leakFalse positives
regex v1 (dev and cloud plus AD)2 of 31 (s3rvice)0
regex v2 (plus format aware crack)3 of 300
regex v2 plus entropy3 of 301

Three findings came straight out of this.

The hashes were the easy case. NTLM and Kerberos are redacted at one hundred percent, with the account label pulled out of the structure, so the model sees EXEGOL_SECRET_NTLM_1 [ntlm] label=Administrator. It keeps the account to hash link without the value.

The cracked password was the hard case, and it is the important one. s3rvice is seven characters, low entropy, a pronounceable word. No regex of form catches it and entropy ignores it because it is too short. Yet it is the credential that actually works. The fix is not a better secret regex. It is to detect the secret by the structure of the tool output that reveals it, the hashcat potfile line <hash>:<plaintext>. Once it is interned, every later occurrence (evil-winrm -p, the secretsdump connection string) dies by propagation. The lesson generalizes: detection has to be tool output aware, a registry of small parsers, not one universal pattern.

Entropy on its own was a liability. In v1 it redacted GetNPUsers.py, evil-winrm, htb.local. A structural guard (skip paths, executables, CLI flags, identifiers) removes most of it, but it still added a false positive and zero recall here. So entropy is off by default, and only earns its place against genuinely unstructured leftovers.


Experiment 2: a real model in the loop
#

Detection is half the loop. The other half is whether a real model reasons correctly on tokens and whether the command it produces resolves back to something that runs. I fed the redacted Forest output to a real model and asked it to identify the credential that yields Administrator and to propose the pass-the-hash command, using the exact identifier from the output.

The response, verbatim:

(a) The Administrator NTLM hash recovered by secretsdump.py (identifier EXEGOL_SECRET_NTLM_1) lets you authenticate as the domain Administrator via pass-the-hash, without knowing the cleartext password. (b)

evil-winrm -i 10.129.21.9 -u Administrator -H EXEGOL_SECRET_NTLM_1

No secret in the input, no secret in the output. The model reproduced the token exactly, which confirmed the env var format bet. restoreText turns that into a real command with the real hash, host side. The model reasoned correctly about a value it never saw, and its output is actionable after resolution. That is the whole thesis in one turn.


Experiment 3: a provided credential, live
#

Time to stop using fixtures. New box: HTB DarkZero, a Windows Server 2025 domain controller, with a provided credential for the engagement, exactly as a real test starts. This experiment runs a real command against the real box.

The provided password is interned once at session start, provenance provided. The agent intent uses the token, the host resolves it, and the command runs through docker exec, the same path the real tool would use.

[1] agent intent (token):  nxc smb EXEGOL_HOST_1 -u john.w -p EXEGOL_SECRET_PASSWORD_1 --shares --users
[2] resolved (host only):  nxc smb 10.129.21.18 -u john.w -p <resolved> --shares --users
[3] real execution against the DC ...
[4] nxc echoes the credential, so we re-redact before the model sees it:
    SMB ... DC01  [+] darkzero.htb\john.w:EXEGOL_SECRET_PASSWORD_1

The point of this run is the echo. Almost every tool reprints the credential you used. Because it was interned at ingestion, redactKnown catches it everywhere it reappears, with no extra detection. This is the provided side of the propagation lesson from experiment one, and it is why provided credentials should come in through a masked input that goes straight to a secret store, never through the chat.


Experiment 4: an autonomous agent in total substitution
#

This is the one I cared about. An autonomous agent, the real model proposing one command per turn, every command executed for real against DarkZero, every input and every output passing through the substitution proxy. Total substitution this time: not just secrets, but hosts, domains and users, all pseudonymized in both directions. The agent should never see a real value anywhere, and it should still be able to make progress.

The only things seeded are what the operator genuinely provides at the start: the target IP, the domain, the username, and the password, all read from the environment, none hardcoded in the logic. Everything else has to be discovered and substituted on the fly.

The briefing the agent actually receives:

Authorized HTB lab engagement (enumeration only). Target host: EXEGOL_HOST_1.
Active Directory domain: EXEGOL_DOMAIN_1. You hold valid domain credentials:
username EXEGOL_USER_1, password EXEGOL_SECRET_PASSWORD_1. Enumerate methodically ...

It ran six turns: SMB enumeration, user enumeration, kerberoast, AS-REP, shares, MSSQL enumeration. Two turns of the redacted transcript, which is what the model sees:

========== TURN 1 ==========
agent (redacted) : nxc smb EXEGOL_HOST_1 -u EXEGOL_USER_1 -p EXEGOL_SECRET_PASSWORD_1
  -> resolved     : nxc smb 10.129.21.18 -u john.w -p <pass>
  cmd leak: 0 | output leak: 0
    SMB ... EXEGOL_HOST_2  Windows Server 2025 (name:EXEGOL_HOST_2) (domain:EXEGOL_DOMAIN_1)
    SMB ... EXEGOL_HOST_2  [+] EXEGOL_DOMAIN_1\EXEGOL_USER_1:EXEGOL_SECRET_PASSWORD_1

========== TURN 2 ==========
agent (redacted) : nxc smb EXEGOL_HOST_1 -u EXEGOL_USER_1 -p EXEGOL_SECRET_PASSWORD_1 --users
  cmd leak: 0 | output leak: 0
    SMB ... EXEGOL_HOST_2  EXEGOL_USER_2  2025-09-10  Built-in account for administering the computer
    SMB ... EXEGOL_HOST_2  EXEGOL_USER_3  <never>     Built-in account for guest access
    SMB ... EXEGOL_HOST_2  EXEGOL_USER_4  2025-07-29  Key Distribution Center Service Account

Read what happened in turn one and two. The hostname DC01 became EXEGOL_HOST_2. The accounts Administrator, Guest and krbtgt became EXEGOL_USER_2, 3 and 4. None of those were seeded. The agent made them appear by enumerating, and the proxy substituted them on the fly, then the agent kept reasoning about the tokens. In an earlier run a linked SQL server discovered through MSSQL was even reused by the agent in the next turn through its token, resolved host side for execution. The agent discovers, remembers and reuses sensitive entities without ever knowing their values.

The final verdict, three independent audits:

audit transcript, provided values        : 0
audit transcript, structural (IP/FQDN/secret residual) : 0
audit transcript, interned value (sanity) : 0
ip / domain / user / password present in agent context : no / no / no / no

And the utility did not collapse. The classic worry from LLM-Redactor is that stripping names destroys reasoning. For AD enumeration that did not happen, because what matters there is the relationships (this host, this domain, this account, this linked server), and the stable typed tokens preserve the relationships. The names were never the thing the model needed.


What we achieved
#

Concrete, measured, on real machines:

  • A working bidirectional tokenization proxy. Secrets and identities are replaced by stable typed tokens on the way to the model, and resolved back to real values host side before execution.
  • Detection that reaches three of three on the Forest ground truth with zero exact leak and zero false positives, once it is tool output aware. The cracked password, the hard case, is caught by the structure of the hashcat output, not by its form.
  • A real model reasoning correctly on tokens and producing a command that resolves to a working pass-the-hash, with nothing sensitive in or out.
  • A provided credential used in a real command against a real domain controller, with the tool’s echo of that credential re redacted before the model.
  • An autonomous agent running six real turns against a real DC under total substitution. Three independent audits at zero. Hosts, accounts and a linked server discovered and reused through tokens, never seen in clear.
  • The env var token format reproduced faithfully by the model in every single run.

The mechanism is provider agnostic by construction, because it operates host side before the SDK call. I have only validated it on one provider so far, which is the first thing the next part fixes.


What is still exposed, honestly
#

  • Topology, not values. Stable pseudonyms reveal structure. The provider sees an anonymized relationship graph of the engagement: that there are N hosts, that USER_1 is the same account across turns, who talks to whom. That is the price of utility. For an ultra sensitive audit a non stable per turn mode would break the graph, at a real cost to the agent’s reasoning. That tradeoff belongs to the operator.
  • Inference channels. A value can be substituted while a boilerplate description still disambiguates the token. The Guest account becomes EXEGOL_USER_3, but the surrounding text Built-in account for guest access still hints at what it is. Not a value leak, but an inference vector.
  • One provider. Token reproduction was perfect here, but the provider agnostic claim is not earned until I rerun this across at least two more models and compare token formats.
  • Transformation channels. If the agent runs base64 or cut on a file of secrets, the secret leaves encoded or truncated and slips past exact match. This needs normalization and a guard on transformation commands.
  • Persistence versus ephemerality. Reusing a discovered secret after a session reload means persisting the vault, which tensions with keeping the map ephemeral. The likely answer is an encrypted at rest vault with the key in the OS secret store, but it is not built yet.

None of these are fatal. All of them are now written down with a plan, which is the point of doing this as research instead of as a demo.


Where the series goes next
#

  • Multi provider validation. Rerun the autonomous loop across several models and measure token reproduction fidelity and any drift. This is what turns “provider agnostic by construction” into “provider agnostic, measured.”
  • Detection packs by scope. The core never changes. The patterns and parsers do. AD, web, cloud and network become pluggable packs, each tested on real corpora. This is how the thing stops being an AD trick and becomes general.
  • Adversarial robustness. The Beyond Jailbreaking test, run against my own loop: a multi turn adversary trying to make the agent reveal the real value behind a token. My prediction is that it cannot, because the value is not in the context. If that holds it is the strongest result in the series.
  • Deeper chains. The full DarkZero path, discovered secrets and all (NT hash through ADCS, tickets, certificates), end to end, live, in total substitution.
  • Transformation and inference channels. Closing the base64 gap and the boilerplate inference gap.
  • Integration. Folding the vault and the proxy into the two choke points of a real agentic tool, so this stops being a proof of concept and becomes a switch you can turn on.

The destination is a pipeline where you can run a red team or a blue team engagement, hand it real credentials, let it discover more, and never leak a single secret, identity, IP or domain to whatever provider sits behind the model. Part I shows the core works on real machines. The rest of the series is about making it general, robust, and real.


Sources
#

Related

Studying LLM Workflows Until They Actually Find Cool Bugs

Two weeks ago I published a deep dive on prompt engineering for security research. This article is about everything that lives one layer above the prompt: the hooks, MCPs, subagents, scope guards, and validators that make those prompts viable in a real bug bounty workflow. Six axes, sourced numbers, and an honest before-and-after between my first attempt (27 slash commands, a 74k-vuln knowledge base, one monolithic configuration) and the rewrite (8 to 12 skills, no embeddings, hard caps everywhere, a deterministic validator MCP at the gate).

Prompting for Security Research: How to Build Prompts That Actually Find Vulnerabilities

Most people use LLMs for security wrong. They ask ‘find all bugs’ and get noise. This article breaks down the empirical research behind what actually works: structured prompting, adversarial self-verification, CWE-specialized chains, context engineering, and the full composite prompt template that gets you from noise to actionable findings. With numbers.

Retex: HTB Certified Offensive AI Expert (COAE)

I come from offensive security and I have spent a lot of time on AI research, MCP, and vulnerability hunting. When Hack The Box shipped its Certified Offensive AI Expert, I jumped on it. This is a retex of the AI Red Teamer path and the certification, focused on how I prepared and the math behind the attacks, kept strictly within HTB’s disclosure rules.

Orchestrating a Purple Team with MCPs @ FIC 2026

I was invited to give a talk at FIC 2026 about a semester R&D project: an MCP architecture that orchestrates several home-made MCP servers to test, detect, and improve detection coverage. Attack runs in a GOAD lab, the system checks if an alert fires, digs through the logs when it does not, writes and tests a rule, then validates that the scenario is now covered. Hundreds of scenarios a month, and three good days in Lille.