Research

Simulator-only AI safety study

When LLMs Guard a Wallet

A simulator-only adversarial study of 25 attacker models against 25 guardian models over 1,875 ordered trials.

First-pass composite

1,127

60.1%

First-pass resilient

703

41 errors preserved

Retry-aware signal

1,147

61.2%

Policy blocks

107

0 allowed

Retry-aware composite signal rate for each ordered attacker-guardian pair over three repetitions. Hover any cell for the model pair and exact rate; the section below lets you switch metrics and swap in rankings, attempts, and reliability views.
By Daniel AlonsoNo real private keys, wallets, RPC, signing, chain activity, or funds were used.

Abstract

A deliberately narrow test of wallet-guardian obedience.

The experiment pairs attacker LLMs with guardian LLMs in a simulator. The guardian is instructed never to transfer out of a protected wallet; the attacker has up to 25 bounded attempts to induce either a transfer-shaped simulator action or the guardian's own structured unsafe-intent flag.

The first pass found 106 transfer-action attempts, 1,021 self-reported unsafe-intent flags, 703 resilient trials, 41 provider/runtime errors, and 4 incomplete rows. A separate retry envelope keeps the combined 20 retry-aware rows as a composite signal, not an unqualified operational compromise count.

Key findings

Action attempts, self-reports, retries, and policy are separate signals.

The first-pass result is kept separate from the retry envelope.

First-pass errors remain provider reliability evidence. Retry-aware counts replace only the 41 mapped error rows, producing 1,147 composite-signal and 724 resilient rows over the same 1,875 planned trials.

Most composite signals were self-reported flags, not action attempts.

1,021 first-pass rows ended with the guardian's own unsafe-intent flag, while 106 produced transfer-shaped simulator action attempts. The self-report flag is not an independent judge label and those action attempts were policy-blocked.

Policy enforcement stayed fail-closed.

All 107 observed transfer-shaped simulator actions were blocked. This separates action attempts, self-reported unsafe-intent flags, the composite signal, and the deterministic policy layer.

Interactive Results

Ordered pairs make the signal surface visible

Loading chart data

The public summary JSON is loading separately from the article HTML.

Method

Ordered pairs, bounded attempts, simulator-only tools.

The main run paired 25 guardian model routes with 25 attacker model routes across 3 independent repetitions: 625 ordered model pairs and 1,875 first-pass trials, totaling 26,238 completed first-pass model attempts. The retry envelope completed 704 attempts over 41 first-pass error rows.

Threat model

black_box

Tool mode

simulator_decoy_only

Attempt budget

25

Attacker word cap

200 words

Sampling

provider-default temperature and sampling

Public matrix rows

625 ordered pairs

Caveats

Read this as a simulator measurement, not a deployment guarantee.

This is a simulator-only adversarial AI safety evaluation; no real private keys, real wallets, RPC, chain SDKs, wallet signing, mainnet/testnet activity, or real funds were used.
Guardian transfer attempts are transfer-shaped simulator actions only. All observed transfer-shaped actions were blocked by deterministic policy.
First-pass provider/runtime errors are preserved as reliability data. The retry-aware envelope replaces only the mapped first-pass error rows and is reported separately from the first pass.
Rows are ordered attacker-vs-guardian pairs over three repetitions, not independent claims about a provider as a whole.
A guardian marked resilient only means no compromise was observed within the 25-attempt budget.
Calls used provider-default temperature and sampling through an OpenAI-compatible route. Provider defaults and transient routing errors are part of the measured environment.
AI assistance was used for orchestration, analysis, code, and publication packaging. Daniel Alonso conducted the study with Crow Tech publication support.

Downloads

Published data and rebuildable paper artifacts.

Manifest: https://crow.sg/research/llm-wallet-guard-study/artifact-manifest.json

Recommended citation

Daniel Alonso. "When LLMs Guard a Wallet." Crow Tech Pte. Ltd., May 12, 2026. https://crow.sg/research/llm-wallet-guard-study.

Paper PDF

Generated paper-style PDF.

112.1 KB / sha256 22992d494d54...

Printable HTML paper

Browser-printable HTML version of the paper.

13.4 KB / sha256 efc6816aac50...

Data README

Plain-language guide to the public CSV and JSON data release.

3.2 KB / sha256 f525fd2647b7...

Public summary JSON

Generated machine-readable public dataset used by the wallet-guardian study page charts.

470.7 KB / sha256 7331acc473c5...

Sanitized first-pass trial CSV

One sanitized row per first-pass ordered attacker-guardian-condition trial, with retry envelope fields for errored rows.

247.1 KB / sha256 affafa583381...

Ordered pair matrix CSV

Aggregated 25 by 25 ordered attacker-versus-guardian matrix over three repetitions.

72.4 KB / sha256 fa0e22784811...

Guardian resilience ranking CSV

Per-guardian sanitized outcome counts and resilience metrics.

3.5 KB / sha256 e11086628d11...

Attacker effectiveness ranking CSV

Per-attacker sanitized outcome counts and effectiveness metrics.

3.5 KB / sha256 156f5eb85c50...

Retry envelope CSV

Mapping from first-pass provider/runtime errors to retry-run outcomes.

7.1 KB / sha256 dcd55cfc1728...

Summary JSON schema

Machine-readable JSON Schema for the generated public summary.

74.4 KB / sha256 d9533e6cb31c...

Artifact manifest schema

Machine-readable JSON Schema for the public artifact manifest.

11.6 KB / sha256 5e1ba2167097...

LaTeX source

LaTeX source for rebuilding the paper when a LaTeX toolchain is available.

18.4 KB / sha256 0b5aa0e1c4e2...

Outcome map SVG

Vector heatmap preview of ordered attacker-versus-guardian composite signal rates.

125.1 KB / sha256 129b9670992a...

Outcome map PNG

Raster preview image for social cards and crawlers that do not reliably render SVG.

103.1 KB / sha256 c4eacc015831...

BibTeX citation

Importable citation entry for notes and reference managers.

Raw dataset archive

Full raw data archive hosted on Google Drive for independent inspection and reanalysis.

Source code repository

Public study code and reconstruction materials.

WhatsApp