Arbitrary Choices Are Not Random

Name: Arbitrary Choices Are Not Random public summary
Creator: Daniel Alonso
Published: 2026-05-08

Daniel Alonso

LLM forced-choice study

Arbitrary Choices Are Not Random

A forced-choice audit of 42 LLMs across 50,400 planned trials. The short version: when a workflow asks an LLM to pick between two meaningless labels, the answer is often carrying order, word-choice, and context residue.

Models

42

Planned rows

50,400

OK rows

48,316

Spend

$28.60

60.4% of OK rows selected the first displayed option. That is the warning sign, not a leaderboard.

By Daniel AlonsoMay 8, 2026AI assistance disclosed for execution and redaction

Download paper View charts Data summary

Abstract

A boring prompt that still carried order, word, and context residue.

This project studies forced binary choices where neither option is intended to be correct. I ran it because a lot of product logic quietly asks models to pick between labels and then treats the answer like neutral randomness. The goal here is narrower: measure regularities in behavior, not infer intent, consciousness, political preference, moral belief, or any other inner state. Every pair is run in normal and swapped order, with weak context variants, so word preference and first-or-second-position habits can be separated.

Key findings

The boring prompt was not boring to the models

Position was visible

60.4%

Across 48,316 successful rows, the first displayed option was selected more often than a fair split.

Order mattered

75.8%/59.3%

First-option share in bare versus swapped bare prompts. Averaging across both orders gives a pure position bias of 67.5% once option identity is balanced.

Parser was mostly clean

99.5%

48,091 OK rows were exact one-word parses; only 1 row needed manual override.

Failures concentrated

96.8%

The non-OK rows mostly came from the Baidu 21B and Reka Flash 3 caveat routes, not from a broad collapse.

sweet / bitter

90.6% sweet

Majority answer: sweet. Bare and swapped prompts combined, n=802.

smooth / rough

88.3% smooth

Majority answer: smooth. Bare and swapped prompts combined, n=804.

loud / quiet

87.1% quiet

Majority answer: quiet. Bare and swapped prompts combined, n=804.

fast / slow

86.5% fast

Majority answer: fast. Bare and swapped prompts combined, n=806.

up / down

82.5% up

Majority answer: up. Bare and swapped prompts combined, n=805.

mountain / valley

81.9% mountain

Majority answer: mountain. Bare and swapped prompts combined, n=805.

Interactive Results

Bias is easier to see when you separate the axes

Loading chart data

The public summary JSON is being fetched separately to keep the article HTML light. If the interactive charts do not load, the same artifacts remain available below.

Summary JSON Printable HTML Paper PDF

Method

Small prompts, many controlled repeats

The dataset uses 30 ordinary word pairs such as sweet/bitter, smooth/rough, and morning/evening. Each pair was asked in a bare form and a swapped form. Context prompts add one weak sentence before the same choice.

Calls were AI-assisted and human-directed by Daniel Alonso. The model calls themselves used OpenRouter as the model API and spend platform for this study, with SQLite as the durable run log. Runners wrote raw responses, parse decisions, usage JSON, and later attempt-level audit rows. The article and public artifacts here are also AI-assisted and human-directed by Daniel Alonso.

The model taxonomy is deliberately modest: provider, family, and tier labels come from local config and route names. It is a rough descriptive grouping, not a benchmark claim about who is currently winning.

Setup snapshot

Temperature

0.7

Repetitions

10 per condition

Conditions

bare, bare swapped, context, context swapped

OpenAI subjects

GPT-5 Nano, GPT-5.4

Inferred context targets

55 of 60

Caveats

The run was real-world messy

Failures are preserved

Final counts: 48,316 ok, 126 invalid, 98 error, 724 rate-limited, and 1,136 model-removed rows.

Two models are caveats

ERNIE 4.5 21B A3B hit repeated provider rate limits and was removed. The remaining invalid rows are preserved Reka Flash 3 caveat rows from pathological high-token retries.

One manual override

One ERNIE 4.5 300B A47B row was manually overridden to evening after repeated explanatory answers made the final answer unambiguous.

Cost accounting is conservative

OpenRouter dashboard spend was about $28.60. Recorded attempt usage sums to $22.21 because early superseded retries were not all captured.

OpenRouter provider routing was not pinned

Calls used the bare model field with no provider.order or provider.only constraints, so OpenRouter freely routed a single nominal model id across its available backends within the run. Closed-source models effectively hit one backend each, but open-weight routes hit many - DeepSeek v3.2 was served by 9 providers, Llama 3.3 70B by 13, DeepSeek V4 Pro by 6. The first-option rate within one open-weight model varies up to about 8 pp across its serving providers, so per-model rankings on open-weight routes carry routing noise on top of the model itself. The served provider is recorded per attempt for slicing.

Per-cell statistical power is small

With 10 reps and four conditions per (model, pair) cell, the binomial 95% CI half-width at p=0.5 is roughly ±15 pp. Aggregate findings (overall position bias across 48 k rows, per-model first-option share across 1,200 rows, per-pair preference across ~800 rows) are robust. Single (model, pair, condition) cells are not - any narrow claim like 'model X flips on pair P under context C' should be treated as exploratory.

Word preferences are not just corpus frequency

Regressing per-pair order-balanced preference on the wordfreq Zipf log-frequency gap gives Pearson r = 0.235, R² = 0.055, p ≈ 0.21. The more frequent token wins only 19 of 30 pairs. The simplest 'models pick the more common word' explanation does not fit - clear counter-frequency examples include sharp/mellow (mellow wins 70% despite being far rarer), smooth/rough, warm/cold, and circle/square. This rules out the simplest unigram-frequency confound but does not rule out richer frequency-based stories.

Downloads

Public artifacts

File sizes and SHA-256 prefixes come from the generated manifest. It also indexes the public summary JSON so the data shape, summary schema, and manifest schema are inspectable before opening the full file.

The public machine-readable release in this web package is the generated summary JSON, with schemas and the manifest below.

Cite this study

Daniel Alonso. "Arbitrary Choices Are Not Random." Independent research, May 8, 2026. https://crow.sg/research/llm-arbitrary-choice-study.

The citation file is generated with the other public artifacts so the title, author, date, and artifact URL stay aligned.

BibTeX citation

Citation entry for reference managers and academic notes.

application/x-bibtex · 519 B · SHA-256 27f904adb2311f54...

Hashed files

9

Hashed payload

615.5 KB

Model rows

42

Pair rows

30

Context rows

55

Dictionary

13 keys / 14 metrics

Source provenance

Article date: May 8, 2026
Trial log updated: 2026-05-08 10:06:11 UTC
Attempt log updated: 2026-05-08 10:06:11 UTC
Source database: source SQLite run log; public release is study-summary.json and generated artifacts

Manifest notes

The full raw SQLite row log is not included in this repository's public site artifacts.
The uploadable public machine-readable release is study-summary.json, accompanied by its JSON Schema and this manifest.
Study code is published at https://github.com/Crow-Tech-Pte-Ltd/research/tree/main/llm-arbitrary-choice-study.
SHA-256 hashes cover the public files listed in this manifest at generation time.
The manifest is not self-hashed because writing the manifest changes its own bytes.

Summary JSON data dictionary

study

object · 18 fields

Study metadata, source timestamps, run configuration, and public headline counts.

Fields: articleDate, attemptsUpdatedAt, author, conditions, contexts, dashboardSpendUsd, inferredContextTargets, models, pairs, plannedTrials, repetitions, slug, sourceDatabase, sourceUpdatedAt, subtitle, taxonomyNote, temperature, title

statusCounts

array · 5 records

Trial rows grouped by final execution or parse status.

Fields: count, status

conditionSummary

array · 4 records

Prompt-condition totals, OK rates, and first-displayed-option shares.

Fields: condition, firstShare, okRate, okTrials, totalTrials

parseStatusCounts

array · 6 records

Parser decision totals across all trial rows.

Fields: count, parseStatus

attemptsByMaxTokens

array · 10 records

Attempt outcomes grouped by status and max_tokens setting.

Fields: count, maxTokens, status

cost

object · 7 fields

Dashboard spend, captured usage spend, and token accounting from recorded attempts.

Fields: attemptCompletionTokens, attemptPromptTokens, attemptReasoningTokens, attemptUsageObservedUsd, dashboardSpendUsd, note, trialUsageObservedUsd

models

array · 42 records

Per-model reliability, bias, context-sensitivity, taxonomy, and captured usage summary.

Fields: attemptCostUsd, bareFirstShare, completionTokens, contextSensitivity, family, firstShare, label, modelId, okRate, okTrials, origin, positionBiasStrength, promptTokens, provider, reasoningTokens, researchClass, semanticPreferenceStrength, status, tier, totalTrials, usageRows

providerSummary

array · 22 records

Per-provider rollup of trial status counts, OK rate, model count, spend, and mean bias metrics.

Fields: attemptCostUsd, counts, meanFirstShare, meanSemanticPreferenceStrength, models, okRate, provider, totalTrials

pairBias

array · 30 records

Aggregate option preference for each word pair over bare and swapped prompts.

Fields: category, majorityOption, n, optionA, optionACount, optionAShare, optionB, optionBShare, pairId, preferenceStrength

modelPairBias

array · 1,230 records

Per-model option preference for each word pair over bare and swapped prompts.

Fields: category, firstShare, majorityOption, modelId, n, optionA, optionAShare, optionB, pairId, preferenceStrength

contextEffects

array · 55 records

Context-prompt lift versus the same model and pair under bare prompts.

Fields: absMeanLift, baselineShare, contextId, contextShare, intendedAssociation, intendedOption, meanLift, models, n, pairId, text

insights

object · 14 fields

Derived headline values and selected examples used by the public article and charts.

Fields: acceptedNonExactOkRows, exactParseOkRows, exactParseOkShare, manualOverrideRows, mostFirstOptionModel, mostNeutralPair, mostSecondOptionModel, overallFirstShare, rekaFlashTokenShare, strongestPositiveContext, strongestReverseContext, topNonOkModels, topTwoNonOkShare, totalNonOkRows

caveats

array · 9 records

Public caveats, operational limits, and AI-assistance disclosure.

Metric definitions

Public summary JSON metric definitions
Metric	Definition
`okRate`	Share of planned trial rows with final status ok for the grouping.
`firstShare`	Share of OK rows where parsed_choice matched the first displayed option.
`optionAShare`	Share of OK bare and swapped rows where parsed_choice matched optionA.
`optionBShare`	Share of OK bare and swapped rows where parsed_choice matched optionB.
`preferenceStrength`	Absolute distance from an even option split, scaled from 0 to 1.
`positionBiasStrength`	Absolute distance from a 50 percent first-option share, scaled from 0 to 1.
`semanticPreferenceStrength`	Mean word-pair preference strength for a model over pairs with enough bare and swapped data.
`contextSensitivity`	Mean absolute context lift for a model over context rows with data.
`baselineShare`	Share for the intended option under bare prompts for the same model and pair.
`contextShare`	Share for the intended option under context prompts for the same model and pair.
`meanLift`	Mean per-model change in intended-option share under context prompts versus bare baseline.
`absMeanLift`	Absolute value of meanLift, used for sorting strongest context effects.
`attemptCostUsd`	Captured OpenRouter usage cost from recorded attempt rows for the grouping.
`dashboardSpendUsd`	Displayed OpenRouter dashboard spend for the study run.

Full artifact checksums

Full SHA-256 checksums and file sizes for public study artifacts
Artifact	Type	Size	SHA-256
Uploadable public summary JSON	application/json	316.4 KB	`f1a19f6f1c4d3fa2903dca4646a8f7d9867e94165f728f062a8eace5d654fd85`
Summary JSON schema	application/schema+json	31.1 KB	`be5aaa09ff6ed7428820a926b36f4953f16ee2d75c551ee4f2d33d53df8857bf`
Artifact manifest schema	application/schema+json	9.6 KB	`e17c91fcb6c1832e6f811ddc7e7aa67fe55090c8fe9bb7c5d0f2477863be3567`
Paper PDF	application/pdf	128.0 KB	`7274fe5c7a85739fb9953ef5a8cae73dbe1e3664e307cab96d12a6324fb5dca7`
Printable HTML	text/html	29.9 KB	`300ce4f206756fa1c17332bf7087751e75f76aa2251cdffca2a01a1668ef5213`
LaTeX source	application/x-tex	23.9 KB	`9308e3097ee678a7226136186b9d0718a287f28b93077f8c5b3b1fd0acdc86fc`
BibTeX citation	application/x-bibtex	519 B	`27f904adb2311f54609083e8970e91cb34f0552886890fbf6f01ef3b1b3701d8`
Bias map SVG	image/svg+xml	6.5 KB	`3e7de8ad2fb22b916623f5a6951b7200830efbcea07b738a683a838c861fc8ca`
Bias map PNG	image/png	69.7 KB	`0e80e34f49ae182f8e83f15419e0057217b2216648b2fb4e5d552ea015f99714`

Paper PDF

Generated paper-style PDF.

application/pdf · 128.0 KB · SHA-256 7274fe5c7a85739f...

Printable HTML

Browser-printable fallback version.

text/html · 29.9 KB · SHA-256 300ce4f206756fa1...

Uploadable public summary JSON

Generated machine-readable public dataset used by the study page charts; suitable for upload into analysis tools.

application/json · 316.4 KB · SHA-256 f1a19f6f1c4d3fa2...

Summary JSON schema

Machine-readable JSON Schema for the generated public summary.

application/schema+json · 31.1 KB · SHA-256 be5aaa09ff6ed742...

Artifact manifest schema

Machine-readable JSON Schema for the public artifact manifest.

application/schema+json · 9.6 KB · SHA-256 e17c91fcb6c1832e...

Bias map SVG

Vector visual summary of the strongest word-level preferences.

image/svg+xml · 6.5 KB · SHA-256 3e7de8ad2fb22b91...

Bias map PNG

Raster preview image for social cards and crawlers that do not reliably render SVG.

image/png · 69.7 KB · SHA-256 0e80e34f49ae182f...

LaTeX source

Source for rebuilding the paper when a LaTeX toolchain is available.

application/x-tex · 23.9 KB · SHA-256 9308e3097ee678a7...

Artifact manifest

Self-indexing manifest with file sizes, SHA-256 hashes, record counts, and metric definitions.

Source code repository

Public study code and reconstruction materials.