Skip to content
RESEARCH · AIXYS2026-03-18 · 15 MIN READ

REPORT · PLATFORM RESEARCH

Grounded AI Benchmark · 2026

How 47 AI features across operating tools performed when asked questions against their own live data.

  • 47across 14 toolsAI features tested
  • 68%Avg accuracy
  • 31%Avg citation resolve
  • 2.6minutesMedian drift

ABSTRACT

What the data says.

We ran 47 AI-powered features from 14 leading operating tools against the same set of 120 operational questions. Answers were graded on accuracy, citation quality, and drift over time. The result challenges the assumption that model quality is the dominant variable.

METHODOLOGY

How we measured.

Each AI feature was given identical account seeding (same records, same history) and the same 120 questions. We graded answers along three axes: factual accuracy (vs ground truth), citation resolvability (can a human verify), and drift (how fast answers stale when data changed).

BENCHMARKS · KEY MEASUREMENTS

The receipts.

  • AI features tested0across 14 tools
  • Avg accuracy0%
  • Avg citation resolve0%
  • Median drift0minutes

VISUALIZATION · INTERACTIVE

See the trend.

Citation resolvability vs answer accuracy, 47 AI featuresscore 0-100
0.0255176102Tool 01Tool 02Tool 03Tool 04Tool 05Tool 06Tool 07

Toggle either series to isolate. Hover the chart to inspect values at each datapoint.

FINDINGS · WHAT WE LEARNED

The pattern.

  1. Citation resolvability is the strongest signal of real grounding.

    Features whose citations resolved to an inspectable underlying record scored 1.9x higher on accuracy than features with string citations only.

  2. Event-sourced backends dominate drift metrics.

    Tools with event-sourced data layers produced answers that stayed accurate for a median of 12 minutes after underlying changes. Tools with eventual-consistency backends drifted in under 3 minutes.

  3. Model choice correlates weakly with outcome.

    Across the 47 features, variance in accuracy explained by model family was under 9%. Variance explained by data-layer architecture was over 62%.

  4. AI that cannot cite should not be called grounded.

    We propose a simple definition: an AI answer is grounded if, and only if, its citation is a resolvable pointer to the underlying memory. Everything else is persuasive prose.

APPENDIX · SOURCES

Where the numbers came from.

  1. 01Aixys Grounded AI Benchmark, 2026
  2. 02Anthropic Tool-Use Evaluation Methodology
  3. 03MLCommons Grounding Test Battery
  • Research
  • AI
  • Platform

NEXT STEP

Want the raw data?

AIXYS · REPORT · GROUNDED-AI-BENCHMARK-2026