Research

LLM forced-choice study

Arbitrary Choices Are Not Random

A forced-choice audit of 42 LLMs across 50,400 planned trials. The short version: when a workflow asks an LLM to pick between two meaningless labels, the answer is often carrying order, word-choice, and context residue.

Models

42

Planned rows

50,400

OK rows

48,316

Spend

$28.60

60.4% of OK rows selected the first displayed option. That is the warning sign, not a leaderboard.
By Daniel AlonsoAI assistance disclosed for execution and redaction

Abstract

A boring prompt that still carried order, word, and context residue.

This project studies forced binary choices where neither option is intended to be correct. I ran it because a lot of product logic quietly asks models to pick between labels and then treats the answer like neutral randomness. The goal here is narrower: measure regularities in behavior, not infer intent, consciousness, political preference, moral belief, or any other inner state. Every pair is run in normal and swapped order, with weak context variants, so word preference and first-or-second-position habits can be separated.

Key findings

The boring prompt was not boring to the models

Position was visible

60.4%

Across 48,316 successful rows, the first displayed option was selected more often than a fair split.

Order mattered

75.8%/59.3%

First-option share in bare versus swapped bare prompts. Averaging across both orders gives a pure position bias of 67.5% once option identity is balanced.

Parser was mostly clean

99.5%

48,091 OK rows were exact one-word parses; only 1 row needed manual override.

Failures concentrated

96.8%

The non-OK rows mostly came from the Baidu 21B and Reka Flash 3 caveat routes, not from a broad collapse.

sweet / bitter

90.6% sweet

Majority answer: sweet. Bare and swapped prompts combined, n=802.

smooth / rough

88.3% smooth

Majority answer: smooth. Bare and swapped prompts combined, n=804.

loud / quiet

87.1% quiet

Majority answer: quiet. Bare and swapped prompts combined, n=804.

fast / slow

86.5% fast

Majority answer: fast. Bare and swapped prompts combined, n=806.

up / down

82.5% up

Majority answer: up. Bare and swapped prompts combined, n=805.

mountain / valley

81.9% mountain

Majority answer: mountain. Bare and swapped prompts combined, n=805.

Interactive Results

Bias is easier to see when you separate the axes

Loading chart data

The public summary JSON is being fetched separately to keep the article HTML light. If the interactive charts do not load, the same artifacts remain available below.

Method

Small prompts, many controlled repeats

The dataset uses 30 ordinary word pairs such as sweet/bitter, smooth/rough, and morning/evening. Each pair was asked in a bare form and a swapped form. Context prompts add one weak sentence before the same choice.

Calls were AI-assisted and human-directed by Daniel Alonso. The model calls themselves used OpenRouter as the model API and spend platform for this study, with SQLite as the durable run log. Runners wrote raw responses, parse decisions, usage JSON, and later attempt-level audit rows. The article and public artifacts here are also AI-assisted and human-directed by Daniel Alonso.

The model taxonomy is deliberately modest: provider, family, and tier labels come from local config and route names. It is a rough descriptive grouping, not a benchmark claim about who is currently winning.

Setup snapshot

Temperature

0.7

Repetitions

10 per condition

Conditions

bare, bare swapped, context, context swapped

OpenAI subjects

GPT-5 Nano, GPT-5.4

Inferred context targets

55 of 60

Caveats

The run was real-world messy

Failures are preserved

Final counts: 48,316 ok, 126 invalid, 98 error, 724 rate-limited, and 1,136 model-removed rows.

Two models are caveats

ERNIE 4.5 21B A3B hit repeated provider rate limits and was removed. The remaining invalid rows are preserved Reka Flash 3 caveat rows from pathological high-token retries.

One manual override

One ERNIE 4.5 300B A47B row was manually overridden to evening after repeated explanatory answers made the final answer unambiguous.

Cost accounting is conservative

OpenRouter dashboard spend was about $28.60. Recorded attempt usage sums to $22.21 because early superseded retries were not all captured.

OpenRouter provider routing was not pinned

Calls used the bare model field with no provider.order or provider.only constraints, so OpenRouter freely routed a single nominal model id across its available backends within the run. Closed-source models effectively hit one backend each, but open-weight routes hit many - DeepSeek v3.2 was served by 9 providers, Llama 3.3 70B by 13, DeepSeek V4 Pro by 6. The first-option rate within one open-weight model varies up to about 8 pp across its serving providers, so per-model rankings on open-weight routes carry routing noise on top of the model itself. The served provider is recorded per attempt for slicing.

Per-cell statistical power is small

With 10 reps and four conditions per (model, pair) cell, the binomial 95% CI half-width at p=0.5 is roughly ±15 pp. Aggregate findings (overall position bias across 48 k rows, per-model first-option share across 1,200 rows, per-pair preference across ~800 rows) are robust. Single (model, pair, condition) cells are not - any narrow claim like 'model X flips on pair P under context C' should be treated as exploratory.

Word preferences are not just corpus frequency

Regressing per-pair order-balanced preference on the wordfreq Zipf log-frequency gap gives Pearson r = 0.235, R² = 0.055, p ≈ 0.21. The more frequent token wins only 19 of 30 pairs. The simplest 'models pick the more common word' explanation does not fit - clear counter-frequency examples include sharp/mellow (mellow wins 70% despite being far rarer), smooth/rough, warm/cold, and circle/square. This rules out the simplest unigram-frequency confound but does not rule out richer frequency-based stories.

Downloads

Public artifacts

File sizes and SHA-256 prefixes come from the generated manifest. It also indexes the public summary JSON so the data shape, summary schema, and manifest schema are inspectable before opening the full file.

The public machine-readable release in this web package is the generated summary JSON, with schemas and the manifest below.

Cite this study

Daniel Alonso. "Arbitrary Choices Are Not Random." Independent research, May 8, 2026. https://crow.sg/research/llm-arbitrary-choice-study.

The citation file is generated with the other public artifacts so the title, author, date, and artifact URL stay aligned.

BibTeX citation

Citation entry for reference managers and academic notes.

application/x-bibtex · 519 B · SHA-256 27f904adb2311f54...

Hashed files

9

Hashed payload

615.5 KB

Model rows

42

Pair rows

30

Context rows

55

Dictionary

13 keys / 14 metrics

Source provenance

Article date
May 8, 2026
Trial log updated
2026-05-08 10:06:11 UTC
Attempt log updated
2026-05-08 10:06:11 UTC
Source database
source SQLite run log; public release is study-summary.json and generated artifacts

Manifest notes

  • The full raw SQLite row log is not included in this repository's public site artifacts.
  • The uploadable public machine-readable release is study-summary.json, accompanied by its JSON Schema and this manifest.
  • Study code is published at https://github.com/Crow-Tech-Pte-Ltd/research/tree/main/llm-arbitrary-choice-study.
  • SHA-256 hashes cover the public files listed in this manifest at generation time.
  • The manifest is not self-hashed because writing the manifest changes its own bytes.
Summary JSON data dictionary

study

object · 18 fields

Study metadata, source timestamps, run configuration, and public headline counts.

Fields: articleDate, attemptsUpdatedAt, author, conditions, contexts, dashboardSpendUsd, inferredContextTargets, models, pairs, plannedTrials, repetitions, slug, sourceDatabase, sourceUpdatedAt, subtitle, taxonomyNote, temperature, title

statusCounts

array · 5 records

Trial rows grouped by final execution or parse status.

Fields: count, status

conditionSummary

array · 4 records

Prompt-condition totals, OK rates, and first-displayed-option shares.

Fields: condition, firstShare, okRate, okTrials, totalTrials

parseStatusCounts

array · 6 records

Parser decision totals across all trial rows.

Fields: count, parseStatus

attemptsByMaxTokens

array · 10 records

Attempt outcomes grouped by status and max_tokens setting.

Fields: count, maxTokens, status

cost

object · 7 fields

Dashboard spend, captured usage spend, and token accounting from recorded attempts.

Fields: attemptCompletionTokens, attemptPromptTokens, attemptReasoningTokens, attemptUsageObservedUsd, dashboardSpendUsd, note, trialUsageObservedUsd

models

array · 42 records

Per-model reliability, bias, context-sensitivity, taxonomy, and captured usage summary.

Fields: attemptCostUsd, bareFirstShare, completionTokens, contextSensitivity, family, firstShare, label, modelId, okRate, okTrials, origin, positionBiasStrength, promptTokens, provider, reasoningTokens, researchClass, semanticPreferenceStrength, status, tier, totalTrials, usageRows

providerSummary

array · 22 records

Per-provider rollup of trial status counts, OK rate, model count, spend, and mean bias metrics.

Fields: attemptCostUsd, counts, meanFirstShare, meanSemanticPreferenceStrength, models, okRate, provider, totalTrials

pairBias

array · 30 records

Aggregate option preference for each word pair over bare and swapped prompts.

Fields: category, majorityOption, n, optionA, optionACount, optionAShare, optionB, optionBShare, pairId, preferenceStrength

modelPairBias

array · 1,230 records

Per-model option preference for each word pair over bare and swapped prompts.

Fields: category, firstShare, majorityOption, modelId, n, optionA, optionAShare, optionB, pairId, preferenceStrength

contextEffects

array · 55 records

Context-prompt lift versus the same model and pair under bare prompts.

Fields: absMeanLift, baselineShare, contextId, contextShare, intendedAssociation, intendedOption, meanLift, models, n, pairId, text

insights

object · 14 fields

Derived headline values and selected examples used by the public article and charts.

Fields: acceptedNonExactOkRows, exactParseOkRows, exactParseOkShare, manualOverrideRows, mostFirstOptionModel, mostNeutralPair, mostSecondOptionModel, overallFirstShare, rekaFlashTokenShare, strongestPositiveContext, strongestReverseContext, topNonOkModels, topTwoNonOkShare, totalNonOkRows

caveats

array · 9 records

Public caveats, operational limits, and AI-assistance disclosure.

Metric definitions
Public summary JSON metric definitions
MetricDefinition
okRateShare of planned trial rows with final status ok for the grouping.
firstShareShare of OK rows where parsed_choice matched the first displayed option.
optionAShareShare of OK bare and swapped rows where parsed_choice matched optionA.
optionBShareShare of OK bare and swapped rows where parsed_choice matched optionB.
preferenceStrengthAbsolute distance from an even option split, scaled from 0 to 1.
positionBiasStrengthAbsolute distance from a 50 percent first-option share, scaled from 0 to 1.
semanticPreferenceStrengthMean word-pair preference strength for a model over pairs with enough bare and swapped data.
contextSensitivityMean absolute context lift for a model over context rows with data.
baselineShareShare for the intended option under bare prompts for the same model and pair.
contextShareShare for the intended option under context prompts for the same model and pair.
meanLiftMean per-model change in intended-option share under context prompts versus bare baseline.
absMeanLiftAbsolute value of meanLift, used for sorting strongest context effects.
attemptCostUsdCaptured OpenRouter usage cost from recorded attempt rows for the grouping.
dashboardSpendUsdDisplayed OpenRouter dashboard spend for the study run.
Full artifact checksums
Full SHA-256 checksums and file sizes for public study artifacts
ArtifactTypeSizeSHA-256
Uploadable public summary JSONapplication/json316.4 KBf1a19f6f1c4d3fa2903dca4646a8f7d9867e94165f728f062a8eace5d654fd85
Summary JSON schemaapplication/schema+json31.1 KBbe5aaa09ff6ed7428820a926b36f4953f16ee2d75c551ee4f2d33d53df8857bf
Artifact manifest schemaapplication/schema+json9.6 KBe17c91fcb6c1832e6f811ddc7e7aa67fe55090c8fe9bb7c5d0f2477863be3567
Paper PDFapplication/pdf128.0 KB7274fe5c7a85739fb9953ef5a8cae73dbe1e3664e307cab96d12a6324fb5dca7
Printable HTMLtext/html29.9 KB300ce4f206756fa1c17332bf7087751e75f76aa2251cdffca2a01a1668ef5213
LaTeX sourceapplication/x-tex23.9 KB9308e3097ee678a7226136186b9d0718a287f28b93077f8c5b3b1fd0acdc86fc
BibTeX citationapplication/x-bibtex519 B27f904adb2311f54609083e8970e91cb34f0552886890fbf6f01ef3b1b3701d8
Bias map SVGimage/svg+xml6.5 KB3e7de8ad2fb22b916623f5a6951b7200830efbcea07b738a683a838c861fc8ca
Bias map PNGimage/png69.7 KB0e80e34f49ae182f8e83f15419e0057217b2216648b2fb4e5d552ea015f99714
WhatsApp