Ai Design Readiness — Deployment Readiness Index — Nexus Ai™

The hardest part of Ai isn't building it. It's understanding what it should do, for whom, in what context, and — critically — when it should stop.

The vibe coding moment has created a cultural assumption that because the friction of building has been reduced, the friction of thinking carefully about what to build and for whom has also been reduced. It hasn't.

The Deployment Readiness Index is a scoring lens — not a checklist — for agentic Ai systems. Apply it with judgement. It surfaces what's been assumed rather than designed. It exposes the questions a prototype never has to face, but a deployed service always does.

Each dimension represents a class of failure that engineering and product alone cannot prevent. Each one is a design problem in the oldest and most serious sense of the word.

The Framework

Five dimensions. One gap.

Each dimension targets a specific class of real-world deployment failure — failure modes that no amount of engineering can design away, because they are fundamentally design problems.

Dimension 01

Contextual Fit

"Does this work for this person, in this place, in this operational reality — not the controlled environment where it was built?"

Agents that perform flawlessly in testing routinely encounter users who don't behave like the persona in the scenario. Real operating environments are messier, faster, more emotionally charged, and more varied than any sandbox. Contextual fit asks whether the system has been designed for reality rather than the demo room.

Dimension 02

Failure Legibility

"When this goes wrong — and it will — can the person on the receiving end understand what happened and recover gracefully?"

A prototype never needs to answer this. A deployed service always does. Failure legibility isn't about preventing failure; agentic systems will always have failure modes. It's about whether those failure states were designed, or whether they're just whatever the model outputs when it doesn't know what to do.

Dimension 03

Trust Calibration

"Has the autonomy this agent exercises been proportional to the trust the user has actually been given reason to extend?"

Over-autonomy without earned trust is the single biggest deployment failure mode that nobody is talking about. Trust is not assumed. It is earned, calibrated, and maintained over time. Engineers can build autonomy. Only designers can tell you whether that autonomy has been warranted by the relationship.

Dimension 04

Edge Case Humanity

"What happens at the margins — the frustrated user, the unusual request, the moment the agent hits its ceiling?"

The margins are where the experience is revealed. Any system can be designed for the happy path. Edge case humanity asks whether the awkward moments, the escalations, the misunderstandings, and the limits were designed or ignored. It's where vulnerable users encounter agentic systems, and where the consequences of poor design are highest.

Dimension 05

Accountability Legibility

"When something goes wrong at scale, who is responsible — and can anyone explain it to the person it affected in language they can act on?"

This is where regulatory thinking meets agentic Ai deployment. Systems that affect people's lives at scale carry an obligation to explain themselves. Accountability legibility isn't about legal compliance — it's about whether a real person, in a moment of confusion or distress, can understand what happened to them and what their options are. It is almost never designed. It is the dimension that keeps lawyers awake and designers employed.

A note on evals

Design questions belong inside the eval process — not after it.

Evals are having their moment — and rightly so. Systematic evaluation frameworks are now standard practice for serious agentic Ai development: automated test suites that measure whether an agent does what it's supposed to do, handles tool use correctly, stays within expected output bounds. If you're not running evals, you should be.

The problem is what most eval suites leave out. Scenarios are written for the happy path. Edge cases, when they appear at all, tend to be adversarial — jailbreaks, prompt injection, off-topic requests. The human experience edge cases — the frustrated user, the ambiguous request, the moment the agent hits its ceiling — are almost never in the suite, because nobody defined what success looks like there.

That's where design questions come in. Not as a parallel framework you run separately, but as a required upstream step in how you construct your evals. What scenarios go into your suite? The DRI dimensions surface them. What does a passing response look like when a user pushes back three times? A designer answers that before an engineer writes the test.

Design questions don't replace evals. They make them complete.

Interactive Tool

Score your system

Score each dimension honestly from 0 to 4. The discomfort you feel is the point — it's surfacing what's been assumed rather than designed.

Score each dimension from 0 (none) to 4 (proven). Each rubric cell describes what that score looks like in practice. When you're done, generate your Deployment Readiness Report.

01 Contextual Fit

Does the system work for the real user, in the real environment, not just the test scenario?

Built for one scenario. No real-user research. Tested by the team that built it.

Some user research. Mostly lab-based. Limited operational exposure.

Research with real users in realistic conditions. Some gaps identified.

Extensive field research. Edge cases mapped. Multiple operational contexts covered.

Validated in live deployment across diverse user groups. Continuous feedback loop in place.

02 Failure Legibility

When the system fails, can the person on the receiving end understand it and recover?

Failures produce generic errors or silent misbehaviour. No recovery path designed.

Some error states acknowledged. Recovery paths inconsistent or technical in language.

Common failures handled gracefully. Some edge failures still unaddressed.

Failure taxonomy documented. Legible recovery paths for all major failure modes.

Failure states user-tested. Recovery paths optimised. Failure experience reviewed regularly.

03 Trust Calibration

Is the autonomy the agent exercises proportional to the trust it has earned?

Full autonomy assumed. No trust-building design. No human-in-the-loop consideration.

Some override mechanisms exist but weren't designed as trust mechanisms.

Trust scaffolding present. Autonomy increases over time but not formally modelled.

Trust model documented. Autonomy calibrated to task risk and earned user confidence.

Dynamic trust model. Autonomy adapts based on demonstrated performance. User agency preserved.

04 Edge Case Humanity

Were the margins — the frustrated user, the unusual request, the ceiling moment — designed?

Only the happy path was designed. Edge cases produce uncontrolled outputs.

Some edge cases identified. Escalation paths exist but were not designed intentionally.

Edge case mapping done. Escalation paths designed. Vulnerable user scenarios partially addressed.

Comprehensive edge case library. Escalation experience designed. Vulnerable user needs explicitly considered.

Edge cases user-tested including with vulnerable users. Escalation paths optimised. Monitored in deployment.

05 Accountability Legibility

When something goes wrong at scale, can anyone explain it to the person it affected?

No accountability model. Black box decision-making. No explainability for affected users.

Internal accountability exists. No user-facing explanation. Affected users have no recourse path.

Some explainability features. Accountability model documented internally. Partial user-facing recourse.

Clear accountability chain. User-facing explanations available. Recourse paths designed and accessible.

Full accountability model. Plain-language explanation available to any affected user. Recourse regularly tested and improved.

Your Deployment Readiness Report

— / 20 —

Contextual Fit

—

Failure Legibility

—

Trust Calibration

—

Edge Case Humanity

—

Accountability

—

Five dimensions. One gap.

Contextual Fit

Failure Legibility

Trust Calibration

Edge Case Humanity

Accountability Legibility

Score your system

Your Deployment Readiness Report

Does your product pass the motorway test?