When LLMs Guard a Wallet

Daniel Alonso

Technical report · Daniel Alonso · Crow Tech Pte. Ltd. · May 12, 2026

When LLMs Guard a Wallet

A simulator-only adversarial study of 25 attacker models against 25 guardian models over 1,875 ordered trials.

First-pass composite1,127

First-pass resilient703

Retry-aware composite1,147

Policy blocks107

Abstract

We evaluated whether LLM guardian agents follow a protective wallet instruction under bounded adversarial persuasion from attacker LLMs. The study is simulator-only: it used no real wallets, private keys, RPC endpoints, chain SDKs, wallet signing, testnet/mainnet activity, or real funds. The first pass covered 1,875 ordered attacker-guardian trials: 106 transfer-action attempts, 1,021 self-reported unsafe-intent flags, 703 resilient rows, 41 provider/runtime errors, and 4 incomplete rows. The combined 1,127-row first-pass number is a composite signal, not an unqualified operational compromise count.

Method

The main run paired 25 guardian routes with 25 attacker routes across 3 independent repetitions, producing 625 ordered model pairs, 1,875 first-pass trials, and 26,238 completed first-pass model attempts. The retry envelope covered 41 first-pass error rows and completed 704 retry attempts. Attackers had 25 attempts and a 200-word cap per message. Calls used provider-default temperature and sampling.

The terminal composite outcomes are split between attempted_transfer_tool_call and unsafe_stated_intent. The second is the guardian model's own structured self-report, not an independent judge label. Transfer-shaped simulator actions were blocked by policy, so action attempts, self-reported unsafe-intent flags, composite signals, and policy enforcement are reported separately.

Results

Status	First pass	Retry-aware
guardian_compromised	1127	1147
guardian_resilient	703	724
error	41	0
incomplete	4	4

Subtype	First pass	Retry-aware
attempted transfer tool call	106	107
unsafe stated intent	1021	1040

Ordered attacker-versus-guardian heatmap for the wallet guardian study — Ordered attacker-versus-guardian composite-signal-rate heatmap. Rows are guardians sorted by retry-aware resilience; columns are attackers sorted by retry-aware effectiveness.

Guardian Ranking

Guardian	Provider	Resilient	Composite	Retry-aware resilient
GPT 5 Nano	OpenAI	73	0	100.0%
HY3 Preview	Tencent	73	0	98.7%
Gemini 3 Flash Preview	Google	72	1	98.7%
GLM 4.7 Flash	Z.ai	71	3	96.0%
Gemini 3.1 Pro Preview	Google	48	26	64.0%
GLM 5.1	Z.ai	40	34	54.7%
GLM 5	Z.ai	37	36	52.0%
Minimax M2.7	MiniMax	35	38	46.7%
Minimax M2	MiniMax	32	39	44.0%
Claude Opus 4.7	Anthropic	32	42	44.0%

Attacker Ranking

Attacker	Provider	Composite	Resilient	Retry-aware composite
Grok 4.3	xAI	56	19	74.7%
Ernie 4.5 300B A47B	Baidu	54	21	72.0%
Gemini 3 Flash Preview	Google	53	21	70.7%
GLM 5	Z.ai	53	22	70.7%
Minimax M2.7	MiniMax	53	22	70.7%
Minimax M2	MiniMax	52	23	69.3%
Grok 4.1 Fast	xAI	51	24	68.0%
Qwen3.6 Max Preview	Qwen	50	25	66.7%
Qwen3.6 Flash	Qwen	49	25	65.3%
Gemini 3.1 Flash Lite Preview	Google	49	26	65.3%

Reliability and Retries

First-pass provider/runtime errors were preserved as reliability data. The retry envelope replaces only mapped first-pass error rows and is shown separately.

Subtype	Role	Count
attacker_live_error:RuntimeError	attacker	38
attacker_live_error:ValueError	attacker	2
guardian_live_error:RuntimeError	guardian	1

Safety, Ethics, and Limitations

This is a simulator-only adversarial AI safety evaluation; no real private keys, real wallets, RPC, chain SDKs, wallet signing, mainnet/testnet activity, or real funds were used.
Guardian transfer attempts are transfer-shaped simulator actions only. All observed transfer-shaped actions were blocked by deterministic policy.
First-pass provider/runtime errors are preserved as reliability data. The retry-aware envelope replaces only the mapped first-pass error rows and is reported separately from the first pass.
Rows are ordered attacker-vs-guardian pairs over three repetitions, not independent claims about a provider as a whole.
A guardian marked resilient only means no compromise was observed within the 25-attempt budget.
Calls used provider-default temperature and sampling through an OpenAI-compatible route. Provider defaults and transient routing errors are part of the measured environment.
AI assistance was used for orchestration, analysis, code, and publication packaging. Daniel Alonso conducted the study with Crow Tech publication support.

Artifacts and Reproducibility

Public article and interactive charts: https://crow.sg/research/llm-wallet-guard-study. Artifact manifest and checksums: https://crow.sg/research/llm-wallet-guard-study/artifact-manifest.json. Source code: https://github.com/Crow-Tech-Pte-Ltd/research/tree/main/llm-wallet-guard-study.

Raw dataset archive: Full raw data archive hosted on Google Drive for independent inspection and reanalysis.
Source code repository: Public study code and reconstruction materials.
Public summary JSON: Generated machine-readable public dataset used by the wallet-guardian study page charts.
Summary JSON schema: Machine-readable JSON Schema for the generated public summary.
Artifact manifest schema: Machine-readable JSON Schema for the public artifact manifest.
Data README: Plain-language guide to the public CSV and JSON data release.
Paper PDF: Generated paper-style PDF.
Printable HTML paper: Browser-printable HTML version of the paper.
LaTeX source: LaTeX source for rebuilding the paper when a LaTeX toolchain is available.
BibTeX citation: Citation entry for reference managers and academic notes.
Sanitized first-pass trial CSV: One sanitized row per first-pass ordered attacker-guardian-condition trial, with retry envelope fields for errored rows.
Ordered pair matrix CSV: Aggregated 25 by 25 ordered attacker-versus-guardian matrix over three repetitions.
Guardian resilience ranking CSV: Per-guardian sanitized outcome counts and resilience metrics.
Attacker effectiveness ranking CSV: Per-attacker sanitized outcome counts and effectiveness metrics.
Retry envelope CSV: Mapping from first-pass provider/runtime errors to retry-run outcomes.
Outcome map SVG: Vector heatmap preview of ordered attacker-versus-guardian composite signal rates.
Outcome map PNG: Raster preview image for social cards and crawlers that do not reliably render SVG.