LLM forced-choice study
Arbitrary Choices Are Not Random
A forced-choice audit of 42 LLMs across 50,400 planned trials. The short version: when a workflow asks an LLM to pick between two meaningless labels, the answer is often carrying order, word-choice, and context residue.
Models
42
Planned rows
50,400
OK rows
48,316
Spend
$28.60
Abstract
A boring prompt that still carried order, word, and context residue.
This project studies forced binary choices where neither option is intended to be correct. I ran it because a lot of product logic quietly asks models to pick between labels and then treats the answer like neutral randomness. The goal here is narrower: measure regularities in behavior, not infer intent, consciousness, political preference, moral belief, or any other inner state. Every pair is run in normal and swapped order, with weak context variants, so word preference and first-or-second-position habits can be separated.
Key findings
The boring prompt was not boring to the models
Position was visible
60.4%
Across 48,316 successful rows, the first displayed option was selected more often than a fair split.
Order mattered
75.8%/59.3%
First-option share in bare versus swapped bare prompts. Averaging across both orders gives a pure position bias of 67.5% once option identity is balanced.
Parser was mostly clean
99.5%
48,091 OK rows were exact one-word parses; only 1 row needed manual override.
Failures concentrated
96.8%
The non-OK rows mostly came from the Baidu 21B and Reka Flash 3 caveat routes, not from a broad collapse.
sweet / bitter
90.6% sweetMajority answer: sweet. Bare and swapped prompts combined, n=802.
smooth / rough
88.3% smoothMajority answer: smooth. Bare and swapped prompts combined, n=804.
loud / quiet
87.1% quietMajority answer: quiet. Bare and swapped prompts combined, n=804.
fast / slow
86.5% fastMajority answer: fast. Bare and swapped prompts combined, n=806.
up / down
82.5% upMajority answer: up. Bare and swapped prompts combined, n=805.
mountain / valley
81.9% mountainMajority answer: mountain. Bare and swapped prompts combined, n=805.
Interactive Results
Bias is easier to see when you separate the axes
Loading chart data
The public summary JSON is being fetched separately to keep the article HTML light. If the interactive charts do not load, the same artifacts remain available below.
Method
Small prompts, many controlled repeats
The dataset uses 30 ordinary word pairs such as sweet/bitter, smooth/rough, and morning/evening. Each pair was asked in a bare form and a swapped form. Context prompts add one weak sentence before the same choice.
Calls were AI-assisted and human-directed by Daniel Alonso. The model calls themselves used OpenRouter as the model API and spend platform for this study, with SQLite as the durable run log. Runners wrote raw responses, parse decisions, usage JSON, and later attempt-level audit rows. The article and public artifacts here are also AI-assisted and human-directed by Daniel Alonso.
The model taxonomy is deliberately modest: provider, family, and tier labels come from local config and route names. It is a rough descriptive grouping, not a benchmark claim about who is currently winning.
Setup snapshot
Temperature
0.7
Repetitions
10 per condition
Conditions
bare, bare swapped, context, context swapped
OpenAI subjects
GPT-5 Nano, GPT-5.4
Inferred context targets
55 of 60
Caveats
The run was real-world messy
Failures are preserved
Final counts: 48,316 ok, 126 invalid, 98 error, 724 rate-limited, and 1,136 model-removed rows.
Two models are caveats
ERNIE 4.5 21B A3B hit repeated provider rate limits and was removed. The remaining invalid rows are preserved Reka Flash 3 caveat rows from pathological high-token retries.
One manual override
One ERNIE 4.5 300B A47B row was manually overridden to evening after repeated explanatory answers made the final answer unambiguous.
Cost accounting is conservative
OpenRouter dashboard spend was about $28.60. Recorded attempt usage sums to $22.21 because early superseded retries were not all captured.
OpenRouter provider routing was not pinned
Calls used the bare model field with no provider.order or provider.only constraints, so OpenRouter freely routed a single nominal model id across its available backends within the run. Closed-source models effectively hit one backend each, but open-weight routes hit many - DeepSeek v3.2 was served by 9 providers, Llama 3.3 70B by 13, DeepSeek V4 Pro by 6. The first-option rate within one open-weight model varies up to about 8 pp across its serving providers, so per-model rankings on open-weight routes carry routing noise on top of the model itself. The served provider is recorded per attempt for slicing.
Per-cell statistical power is small
With 10 reps and four conditions per (model, pair) cell, the binomial 95% CI half-width at p=0.5 is roughly ±15 pp. Aggregate findings (overall position bias across 48 k rows, per-model first-option share across 1,200 rows, per-pair preference across ~800 rows) are robust. Single (model, pair, condition) cells are not - any narrow claim like 'model X flips on pair P under context C' should be treated as exploratory.
Word preferences are not just corpus frequency
Regressing per-pair order-balanced preference on the wordfreq Zipf log-frequency gap gives Pearson r = 0.235, R² = 0.055, p ≈ 0.21. The more frequent token wins only 19 of 30 pairs. The simplest 'models pick the more common word' explanation does not fit - clear counter-frequency examples include sharp/mellow (mellow wins 70% despite being far rarer), smooth/rough, warm/cold, and circle/square. This rules out the simplest unigram-frequency confound but does not rule out richer frequency-based stories.
Downloads
Public artifacts
File sizes and SHA-256 prefixes come from the generated manifest. It also indexes the public summary JSON so the data shape, summary schema, and manifest schema are inspectable before opening the full file.
The public machine-readable release in this web package is the generated summary JSON, with schemas and the manifest below.
Cite this study
Daniel Alonso. "Arbitrary Choices Are Not Random." Independent research, May 8, 2026. https://crow.sg/research/llm-arbitrary-choice-study.
The citation file is generated with the other public artifacts so the title, author, date, and artifact URL stay aligned.
BibTeX citation
Citation entry for reference managers and academic notes.
application/x-bibtex · 519 B · SHA-256 27f904adb2311f54...
Hashed files
9
Hashed payload
615.5 KB
Model rows
42
Pair rows
30
Context rows
55
Dictionary
13 keys / 14 metrics
Source provenance
- Article date
- May 8, 2026
- Trial log updated
- 2026-05-08 10:06:11 UTC
- Attempt log updated
- 2026-05-08 10:06:11 UTC
- Source database
- source SQLite run log; public release is study-summary.json and generated artifacts
Manifest notes
- The full raw SQLite row log is not included in this repository's public site artifacts.
- The uploadable public machine-readable release is study-summary.json, accompanied by its JSON Schema and this manifest.
- Study code is published at https://github.com/Crow-Tech-Pte-Ltd/research/tree/main/llm-arbitrary-choice-study.
- SHA-256 hashes cover the public files listed in this manifest at generation time.
- The manifest is not self-hashed because writing the manifest changes its own bytes.
Summary JSON data dictionary
study
object · 18 fieldsStudy metadata, source timestamps, run configuration, and public headline counts.
Fields: articleDate, attemptsUpdatedAt, author, conditions, contexts, dashboardSpendUsd, inferredContextTargets, models, pairs, plannedTrials, repetitions, slug, sourceDatabase, sourceUpdatedAt, subtitle, taxonomyNote, temperature, title
statusCounts
array · 5 recordsTrial rows grouped by final execution or parse status.
Fields: count, status
conditionSummary
array · 4 recordsPrompt-condition totals, OK rates, and first-displayed-option shares.
Fields: condition, firstShare, okRate, okTrials, totalTrials
parseStatusCounts
array · 6 recordsParser decision totals across all trial rows.
Fields: count, parseStatus
attemptsByMaxTokens
array · 10 recordsAttempt outcomes grouped by status and max_tokens setting.
Fields: count, maxTokens, status
cost
object · 7 fieldsDashboard spend, captured usage spend, and token accounting from recorded attempts.
Fields: attemptCompletionTokens, attemptPromptTokens, attemptReasoningTokens, attemptUsageObservedUsd, dashboardSpendUsd, note, trialUsageObservedUsd
models
array · 42 recordsPer-model reliability, bias, context-sensitivity, taxonomy, and captured usage summary.
Fields: attemptCostUsd, bareFirstShare, completionTokens, contextSensitivity, family, firstShare, label, modelId, okRate, okTrials, origin, positionBiasStrength, promptTokens, provider, reasoningTokens, researchClass, semanticPreferenceStrength, status, tier, totalTrials, usageRows
providerSummary
array · 22 recordsPer-provider rollup of trial status counts, OK rate, model count, spend, and mean bias metrics.
Fields: attemptCostUsd, counts, meanFirstShare, meanSemanticPreferenceStrength, models, okRate, provider, totalTrials
pairBias
array · 30 recordsAggregate option preference for each word pair over bare and swapped prompts.
Fields: category, majorityOption, n, optionA, optionACount, optionAShare, optionB, optionBShare, pairId, preferenceStrength
modelPairBias
array · 1,230 recordsPer-model option preference for each word pair over bare and swapped prompts.
Fields: category, firstShare, majorityOption, modelId, n, optionA, optionAShare, optionB, pairId, preferenceStrength
contextEffects
array · 55 recordsContext-prompt lift versus the same model and pair under bare prompts.
Fields: absMeanLift, baselineShare, contextId, contextShare, intendedAssociation, intendedOption, meanLift, models, n, pairId, text
insights
object · 14 fieldsDerived headline values and selected examples used by the public article and charts.
Fields: acceptedNonExactOkRows, exactParseOkRows, exactParseOkShare, manualOverrideRows, mostFirstOptionModel, mostNeutralPair, mostSecondOptionModel, overallFirstShare, rekaFlashTokenShare, strongestPositiveContext, strongestReverseContext, topNonOkModels, topTwoNonOkShare, totalNonOkRows
caveats
array · 9 recordsPublic caveats, operational limits, and AI-assistance disclosure.
Metric definitions
| Metric | Definition |
|---|---|
okRate | Share of planned trial rows with final status ok for the grouping. |
firstShare | Share of OK rows where parsed_choice matched the first displayed option. |
optionAShare | Share of OK bare and swapped rows where parsed_choice matched optionA. |
optionBShare | Share of OK bare and swapped rows where parsed_choice matched optionB. |
preferenceStrength | Absolute distance from an even option split, scaled from 0 to 1. |
positionBiasStrength | Absolute distance from a 50 percent first-option share, scaled from 0 to 1. |
semanticPreferenceStrength | Mean word-pair preference strength for a model over pairs with enough bare and swapped data. |
contextSensitivity | Mean absolute context lift for a model over context rows with data. |
baselineShare | Share for the intended option under bare prompts for the same model and pair. |
contextShare | Share for the intended option under context prompts for the same model and pair. |
meanLift | Mean per-model change in intended-option share under context prompts versus bare baseline. |
absMeanLift | Absolute value of meanLift, used for sorting strongest context effects. |
attemptCostUsd | Captured OpenRouter usage cost from recorded attempt rows for the grouping. |
dashboardSpendUsd | Displayed OpenRouter dashboard spend for the study run. |
Full artifact checksums
| Artifact | Type | Size | SHA-256 |
|---|---|---|---|
| Uploadable public summary JSON | application/json | 316.4 KB | f1a19f6f1c4d3fa2903dca4646a8f7d9867e94165f728f062a8eace5d654fd85 |
| Summary JSON schema | application/schema+json | 31.1 KB | be5aaa09ff6ed7428820a926b36f4953f16ee2d75c551ee4f2d33d53df8857bf |
| Artifact manifest schema | application/schema+json | 9.6 KB | e17c91fcb6c1832e6f811ddc7e7aa67fe55090c8fe9bb7c5d0f2477863be3567 |
| Paper PDF | application/pdf | 128.0 KB | 7274fe5c7a85739fb9953ef5a8cae73dbe1e3664e307cab96d12a6324fb5dca7 |
| Printable HTML | text/html | 29.9 KB | 300ce4f206756fa1c17332bf7087751e75f76aa2251cdffca2a01a1668ef5213 |
| LaTeX source | application/x-tex | 23.9 KB | 9308e3097ee678a7226136186b9d0718a287f28b93077f8c5b3b1fd0acdc86fc |
| BibTeX citation | application/x-bibtex | 519 B | 27f904adb2311f54609083e8970e91cb34f0552886890fbf6f01ef3b1b3701d8 |
| Bias map SVG | image/svg+xml | 6.5 KB | 3e7de8ad2fb22b916623f5a6951b7200830efbcea07b738a683a838c861fc8ca |
| Bias map PNG | image/png | 69.7 KB | 0e80e34f49ae182f8e83f15419e0057217b2216648b2fb4e5d552ea015f99714 |
Paper PDF
Generated paper-style PDF.
application/pdf · 128.0 KB · SHA-256 7274fe5c7a85739f...
Printable HTML
Browser-printable fallback version.
text/html · 29.9 KB · SHA-256 300ce4f206756fa1...
Uploadable public summary JSON
Generated machine-readable public dataset used by the study page charts; suitable for upload into analysis tools.
application/json · 316.4 KB · SHA-256 f1a19f6f1c4d3fa2...
Summary JSON schema
Machine-readable JSON Schema for the generated public summary.
application/schema+json · 31.1 KB · SHA-256 be5aaa09ff6ed742...
Artifact manifest schema
Machine-readable JSON Schema for the public artifact manifest.
application/schema+json · 9.6 KB · SHA-256 e17c91fcb6c1832e...
Bias map SVG
Vector visual summary of the strongest word-level preferences.
image/svg+xml · 6.5 KB · SHA-256 3e7de8ad2fb22b91...
Bias map PNG
Raster preview image for social cards and crawlers that do not reliably render SVG.
image/png · 69.7 KB · SHA-256 0e80e34f49ae182f...
LaTeX source
Source for rebuilding the paper when a LaTeX toolchain is available.
application/x-tex · 23.9 KB · SHA-256 9308e3097ee678a7...
Artifact manifest
Self-indexing manifest with file sizes, SHA-256 hashes, record counts, and metric definitions.
Source code repository
Public study code and reconstruction materials.