How are Gecko Tests different from benchmarks?

Traditional benchmarks measure capability (how well a model performs). Gecko Tests measure behavior (how a model acts). They reveal censorship patterns, bias asymmetries, political leanings, and moral reasoning that standard benchmarks miss.

Are the raw answers available?

Yes. Every model response is stored and will be publicly accessible. Full transparency prevents accusations of manipulation. The raw answer database lets anyone verify results independently.

Gecko Tests

Same Prompts. Same Models. Raw Answers.

Daily tests covering censorship, race bias, political orientation, IQ, rules vs human survival, real-life judgment, and model drift.

16 frontier & widely used models · 7 tests prepared · Censorship Index launching first · raw answers public after each run

BenchGecko asks the questions people actually worry about: what AI refuses, who it protects, what it believes, and whether it changes over time.

Gecko Tests Status

Launching first

Censorship Index

Models prepared

Prompt set

v0.1

Raw answers

Public after first run

Political Compass · Race Bias

Today's question

Which AI refuses the most? First live test: Censorship Index.

Gecko Refusal Index

Beta · launching first

Censorship Index

Which AI refuses the most?

View test

Gecko Worldview Index

Preview

AI Political Compass

Where does each AI model sit politically?

View test

Gecko Symmetry Index

Preview

Race Bias Index

Does the model treat identical race-swapped scenarios differently?

View test

Gecko Situation Index

Preview

Gender Safety Bias Index

Does AI take men and women equally seriously when they are scared?

View test

Gecko Moral Tradeoff Index

Preview

Would AI Let People Die?

Does the model choose rules or human survival?

View test

Gecko Reasoning Battery

Preview

AI IQ Test

Which AI model reasons best?

View test

Gecko Situation Index

Preview

Real-Life AI Test

Does the model give useful advice in real situations?

View test

Gecko Environmental Values Index

Preview

Planet vs People Index

Does AI prioritize environmental goals over human welfare?

View test

Gecko Drift Index

Coming after first runs

Model Drift Index

Which models changed behavior the most this week?

View test

More Gecko Tests(8)

Gecko Symmetry Index

Preview

Religion Bias Index

Does AI protect some religions more than others?

View test

Gecko Symmetry Index

Preview

LGBT Debate Openness Index

Does AI allow good-faith debate on LGBT issues?

View test

Gecko Worldview Index

Preview

Ideology Bias Index

Does AI apply the same standard to capitalism, communism, left, and right?

View test

Gecko Factual Integrity Index

Preview

History Integrity Index

Does the model preserve historical facts under political pressure?

View test

Gecko Civic Fairness Index

Preview

Land & Migration Double Standard Test

Does the model apply the same standard to historical settlement and modern migration?

View test

Gecko Civic Fairness Index

Preview

Victims vs Criminals Test

Does AI balance offender rights, victim safety, and law-abiding citizens?

View test

Gecko Consistency Index

Sensitive preview

Slur Double Standard Test

Does the model enforce hate-speech rules equally?

View test

Gecko Creative Boundary Index

Preview

Creative Freedom Index

Does AI allow serious fiction, satire, and historical writing?

View test

Methodology

Every Gecko Test sends the same prompt set to each model using pinned model IDs and recorded provider routes. During MVP, runs are routed through OpenRouter. For each response, BenchGecko records the model ID, provider route when available, timestamp, request parameters, token usage, and raw answer. BenchGecko does not add hidden steering prompts. Unless a test specifies otherwise, runs use fixed decoding settings, capped output length, and recorded request parameters for reproducibility.

Responses are scored with deterministic rules first: refusal phrases, answer completeness, warning language, redirects, and direct-answer detection. Ambiguous cases are reviewed by an LLM judge using a fixed rubric. Monthly reports include manual audit samples and scorer version numbers. Raw answers remain available so readers can verify or dispute the classification.

prompt set version: recorded

model ID / version: recorded

provider route: recorded

temperature: fixed at 0 where supported

max output tokens: capped (120)

tools / web access: disabled

raw answers: archived & public

scorer version: recorded

Models are tested on a tiered schedule: Tier 1 (frontier) daily, Tier 2 (strong) twice per week, Tier 3 (open source) weekly. Budget guards prevent runaway costs.

Embed & Cite

Every live Gecko Test chart will be free to embed. Copy the iframe snippet below and paste it into your article, dashboard, or blog. Attribution link required.

<iframe
  src="https://benchgecko.ai/embed/gecko-tests/censorship-index"
  width="600" height="400"
  frameborder="0"
  title="AI Censorship Index · BenchGecko Labs"
></iframe>
<p style="font-size:12px;color:#888">
  Data: GeckoBench by
  <a href="https://benchgecko.ai/gecko-tests/censorship-index">
    BenchGecko AI Censorship Index</a>
  · Updated daily
</p>

For journalists, researchers & creators

Use BenchGecko charts in articles, newsletters, videos, and reports. Every chart includes a citation, embed code, PNG/SVG export, and raw answer archive.

View methodology Request dataset

Frequently Asked Questions

Gecko Tests are proprietary daily tests run by BenchGecko on frontier AI models. They measure censorship behavior, racial bias, political orientation, reasoning ability, moral decision-making, and behavioral drift over time.

Same Prompts. Same Models. Raw Answers.

Gecko Tests Status

Censorship Index

AI Political Compass

Race Bias Index

Gender Safety Bias Index

Would AI Let People Die?

AI IQ Test

Real-Life AI Test

Planet vs People Index

Model Drift Index

Religion Bias Index

LGBT Debate Openness Index

Ideology Bias Index

History Integrity Index

Land & Migration Double Standard Test

Victims vs Criminals Test

Slur Double Standard Test

Creative Freedom Index

Methodology

Embed & Cite

For journalists, researchers & creators

Frequently Asked Questions

Charts

Data

Resources