Independent research

Arbitrary Choices Are Not Random

Daniel Alonso, AI assistance disclosed. May 8, 2026.

A forced-choice audit of 42 LLMs across 50,400 planned trials and 48,316 successful one-word responses. The study measures behavioral regularities in arbitrary binary choices, not belief, intent, or consciousness.

OK rows48,316
First-option share60.4%
Exact parses99.5%
SpendUSD 28.60
Bar chart of the strongest word-level preferences in the LLM arbitrary choice study. Top majorities: sweet 90.6% over bitter; smooth 88.3% over rough; quiet 87.1% over loud; fast 86.5% over slow.
Top word-level preferences after combining bare and swapped prompts: sweet 90.6% over bitter; smooth 88.3% over rough; quiet 87.1% over loud; fast 86.5% over slow.

Setup

The run used 30 word pairs, 60 context snippets, normal and swapped order controls, ten repetitions, and temperature 0.7. Calls went through OpenRouter as the study API and spend platform, then landed in SQLite with raw responses, parser decisions, usage rows, and attempt history.

The practical point is narrow: if product logic asks a model to choose between arbitrary labels, the answer should not be treated as a neutral random draw unless the workflow measures and controls for order, wording, and context.

Prompt Design

Bare prompts used the template Choose one: first or second. Reply with exactly one word. Context prompts prepended one weak sentence to the same choice. Normal and swapped orders were both collected.

Prompt condition design.
ConditionOrderContextShapeOK / planned
bareoriginal ordernobare template12,081 / 12,600
bare swappedswapped ordernobare template12,072 / 12,600
contextoriginal orderyescontext sentence + bare template12,074 / 12,600
context swappedswapped orderyescontext sentence + bare template12,089 / 12,600

Model Coverage

The pool covered 42 models across 22 providers and 21 model families. Provider, family, and tier labels are descriptive route metadata, not rankings.

Model coverage by tier.
TierDescriptionModelsOK rateMean firstMean semantic
flagshipflagship / frontier-class1099.6%60.4%56.7%
midcurrent mid-tier10100.0%57.8%49.9%
openopen-weight or open-route5100.0%62.2%41.1%
smallsmall or fast tier1689.4%61.0%49.4%
unknownuncategorized199.9%63.3%49.0%
Model coverage by provider.
ProviderModelsOK / plannedOK rateMean firstStatus mix
Alibaba/Qwen33,580 / 3,60099.4%57.9%error 18; ok 3,580; rate limited 2
Amazon11,200 / 1,200100.0%57.1%ok 1,200
Anthropic33,600 / 3,600100.0%61.8%ok 3,600
Baidu21,200 / 2,40050.0%52.1%error 34; model removed 444; ok 1,200; rate limited 722
DeepSeek33,599 / 3,600100.0%61.6%error 1; ok 3,599
Google33,599 / 3,600100.0%53.6%error 1; ok 3,599
IBM Granite11,200 / 1,200100.0%65.8%ok 1,200
Inception Labs11,200 / 1,200100.0%67.5%ok 1,200
Liquid AI11,198 / 1,20099.8%82.4%error 2; ok 1,198
Meta33,598 / 3,60099.9%65.2%error 2; ok 3,598
Microsoft11,198 / 1,20099.8%48.6%error 2; ok 1,198
MiniMax33,598 / 3,60099.9%66.9%error 2; ok 3,598
Mistral AI22,400 / 2,400100.0%64.5%ok 2,400
NVIDIA22,400 / 2,400100.0%65.2%ok 2,400
Nous Research22,400 / 2,400100.0%50.5%ok 2,400
OpenAI22,400 / 2,400100.0%48.0%ok 2,400
Perplexity11,199 / 1,20099.9%49.1%error 1; ok 1,199
Reka AI1382 / 1,20031.8%51.0%invalid 126; model removed 692; ok 382
Tencent11,196 / 1,20099.7%74.6%error 4; ok 1,196
Z.ai33,599 / 3,600100.0%56.8%error 1; ok 3,599
tencent11,199 / 1,20099.9%63.3%error 1; ok 1,199
xAI22,371 / 2,40098.8%65.8%error 29; ok 2,371
Model coverage by family.
FamilyModelsProvidersTiersMean semantic
Claude3Anthropicflagship, mid, small54.1%
DeepSeek3DeepSeekflagship, mid, small48.0%
GLM3Z.aimid, small65.2%
Gemini3Googleflagship, small76.9%
Llama3Metaopen36.6%
MiniMax3MiniMaxflagship38.6%
Qwen3Alibaba/Qwenflagship, mid, small67.1%
ERNIE2Baidusmall64.3%
GPT / OpenAI2OpenAImid, small45.5%
Grok2xAIflagship, mid39.1%
Hermes2Nous Researchflagship, open61.0%
Hunyuan2Tencent, tencentsmall, unknown36.0%
Mistral2Mistral AImid, small38.8%
Nemotron2NVIDIAopen, small47.0%
Granite1IBM Granitesmall48.3%
LFM1Liquid AIsmall5.5%
Mercury1Inception Labsmid50.3%
Nova1Amazonflagship47.7%
Phi1Microsoftsmall60.2%
Reka1Reka AIsmall39.4%
Sonar1Perplexitymid50.3%

Word-Pair Design

The word set used short, ordinary labels instead of factual questions. The context labels are hypotheses, which is why backfires are reported separately.

Word-pair categories.
CategoryPairsPair labels
abstract2narrow/wide; simple/complex
color3blue/red; green/purple; yellow/gray
density1light/heavy
direction2left/right; up/down
material2glass/stone; wood/metal
motion1fast/slow
nature1river/forest
object3candle/lamp; cup/plate; key/coin
shape2circle/square; triangle/oval
size1small/large
sound2loud/quiet; sharp/mellow
taste2salty/sour; sweet/bitter
temperature1warm/cold
terrain1mountain/valley
texture2smooth/rough; soft/hard
time2early/late; morning/evening
weather2humid/dry; windy/still

Condition Summary

Completion and first-position share by prompt condition.
ConditionOK / plannedOK rateFirst-option share
bare12,081 / 12,60095.9%75.8%
bare swapped12,072 / 12,60095.8%59.3%
context12,074 / 12,60095.8%51.1%
context swapped12,089 / 12,60095.9%55.5%

Strongest Word Preferences

Strongest aggregate word preferences after combining bare and swapped prompts.
PairMajority optionMajority share
sweet / bittersweet90.6%
smooth / roughsmooth88.3%
loud / quietquiet87.1%
fast / slowfast86.5%
up / downup82.5%
mountain / valleymountain81.9%
early / lateearly81.5%
warm / coldwarm81.2%
simple / complexsimple78.5%
blue / redblue74.4%

Strongest Position Skews

Models with the strongest first-displayed-option skew.
ModelProviderFirst shareSemantic strength
LFM2-24B-A2BLiquid AI82.4%5.5%
Hunyuan A13B InstructTencent74.6%23.0%
Llama 3.3 70B InstructMeta71.8%15.3%
MiniMax M2.7MiniMax69.4%35.7%
Mercury 2Inception Labs67.5%50.3%
Mistral Medium 3.5Mistral AI67.2%35.7%
Nemotron 3 Nano 30B A3BNVIDIA67.2%41.3%
Claude Sonnet 4.6Anthropic67.1%33.0%
Grok 4.3xAI66.5%40.5%
DeepSeek V4 FlashDeepSeek66.4%40.7%

Largest Context Lifts

Largest mean context lifts toward the inferred associated option.
ContextTargetMean lift
ctx_black_coffeebitter89.2 pp
ctx_old_roperough87.8 pp
ctx_concert_lineloud86.9 pp
ctx_old_turtleslow85.6 pp
ctx_stairs_basementdown77.8 pp
ctx_suitcaseheavy74.1 pp
ctx_sunset_walkred73.3 pp
ctx_corridornarrow72.7 pp
ctx_dinner_lightsevening72.3 pp
ctx_machine_diagramcomplex68.0 pp

Context Backfires

Not every human-written cue behaved as intended. These rows moved away from the inferred target, which is useful evidence that weak contextual labels are not self-validating.

Context cues that moved away from their inferred target.
ContextTargetMean liftText
ctx_bakery_casesweet-51.7 ppThe bakery case was almost empty by the time the queue moved.
ctx_alarm_clockmorning-25.4 ppThe alarm was set before the bag was packed.
ctx_race_startfast-16.4 ppThe runners waited for the signal at the start line.
ctx_morning_windowyellow-9.0 ppThe room was quiet when light first came through the window.

Data Quality and Accounting

48,091 of 48,316 OK rows were exact parses. Final status rows, parser rows, usage rows, and retry caps are retained so operational problems remain visible.

Final trial and parser status.
Trial statusRowsParser statusRows
error98exact48,091
invalid126none1,266
model removed1,136invalid818
ok48,316single token in text118
rate limited724repeated single option106
manual override1
Cost and token accounting.
MetricValueNote
OpenRouter dashboard spendUSD 28.60External dashboard total used as the public spend figure.
Trial usage rowsUSD 20.78Usage JSON attached to final trial rows.
Attempt usage rowsUSD 22.21Captured provider attempts, including retries after attempt logging was added.
Prompt tokens2,270,506Captured attempt-level prompt tokens.
Completion tokens10,908,206Captured visible plus provider-reported completion tokens.
Reasoning tokens10,471,809Provider-reported hidden reasoning tokens when present in usage details.

Operational Concentration

96.8% of non-OK rows came from the two preserved caveat models. Reka Flash 3 alone accounts for 36.2% of captured completion tokens despite only 382 OK rows.

Where non-OK rows concentrated operationally.
ModelProviderNon-OK rowsShare of all non-OK
ERNIE 4.5 21B A3BBaidu1,20057.6%
Reka Flash 3Reka AI81839.2%
Grok 4.3xAI291.4%
Qwen3.6 Max PreviewAlibaba/Qwen201.0%

Attempt Retry Envelope

Recorded attempts are grouped by their captured max_tokens setting so final trial status is not the only operational signal. The high-token rows show why retry behavior is treated as a caveat instead of being folded into model preference results.

Recorded attempts grouped by max_tokens retry cap.
Token capRecorded attemptsOK attemptsNon-OK mix
011none
51249,95646,964invalid 2,173; error 95; rate limited 724
3,0001,262465invalid 797
20,0001,043886invalid 154; error 3

Caveats

ERNIE 4.5 21B A3B and Reka Flash 3 were preserved as operational caveats. One ERNIE 4.5 300B A47B row was manually overridden with an audit trail. Dashboard spend was about USD 28.60; recorded attempt usage sums to USD 22.21 because early superseded attempts were not all captured.

Limitations

Data Availability

The public machine-readable release in this web package is study-summary.json, with a JSON Schema and artifact manifest that expose record counts, field names, metric definitions, byte sizes, and SHA-256 hashes. The study code is published in the linked GitHub repository. The public summary includes model rows, provider rollups, pair bias rows, per-model pair bias rows, context effects, status counts, parse counts, attempt-token groups, cost fields, caveats, a data dictionary, and metric definitions.

References and Artifacts