AI methodology

Evaluating an AI prototype: a checklist.

A working demo is not the same as a working product. The gap between the two is filled with edge cases the demo did not show, costs the demo did not surface, and failure modes that only appear at scale. Eight questions help close that gap, and we use a version of them ourselves when reviewing prototypes. None of them require technical depth from the asker; all of them require honest answers from the team that built the prototype. A team that can answer them clearly is usually further along than a team whose demo looks slicker.

1. What does it do well, and what does it refuse to do?

The first question is about the operating envelope. A good prototype has a clear edge — the team can articulate the cases it handles reliably and the cases it deliberately does not, without retreating to "anything you ask it". The envelope is usually narrower than the team initially claims.

A prototype that claims to handle everything probably handles nothing reliably. The honest answer names the inputs the team has tested, the inputs out of scope, and the boundary in between. If the team cannot draw that boundary, the prototype has only been demoed against the cases that work.

2. What's the failure mode?

There are three. The first is silent wrong: the model produces a plausible but incorrect answer, presented with the same confidence as a correct one. The second is loud refusal: the model declines to help, or returns an empty result, in a way the user cannot miss. The third is visible uncertainty: the model returns an answer flagged as low-confidence, with the flagging surfaced to the user.

Silent wrong is the dangerous one, because the user has no way to tell it from a real answer until they check downstream. Ask the team which mode they have optimised for and how they tell which one is happening in production. A team that has not thought about this will not have an answer; a team that has will usually have a graph.

3. How was it evaluated?

Three sub-questions, in order. Was evaluation done by eyeball, by automated check, or by human grader? What was the sample size? And how was the sample selected? A prototype evaluated on five hand-picked examples has not been evaluated; it has been demoed. A few hundred examples drawn at random from production-shape data has signal, and a team that can name both the sample size and the sampling method is one that has done the work.

Eyeball checks are not nothing, but they do not generalise. The team should be able to describe how their evaluation set was assembled, whether it covers the cases that matter most to the business, and how often it gets re-run. The last point most often distinguishes a prototype from a product.

4. What's the latency budget?

For the user, not for the engineer. A demo that takes seven seconds in front of a friendly audience is a prototype that takes seven seconds in front of a paying customer, and latency tolerated in a demo is rarely tolerated in production. The figure that matters is the one the user experiences end-to-end, including retrieval, tool calls, and rendering — not the model's response time in isolation.

Ask the user-facing latency target and how the current prototype compares. If the target is two seconds and the prototype runs in seven, the team has more work ahead of them than the demo suggests, and the cost of that work is worth understanding before commitments are made.

5. What's the per-call cost?

And how it scales. Costs that are negligible at prototype scale — a few hundred calls a day across a small internal user base — can become material at production scale, where a few hundred thousand calls a day is unremarkable. The team should be able to give a per-call estimate and a back-of-envelope monthly figure at the volume they are aiming for.

The number to listen for is the one that includes the cases the demo did not show: long inputs, retries on failure, multi-turn conversations that accumulate context. A team that quotes a per-call cost from the happy path alone is quoting the floor, not the average. The average is what shows up on the bill.

6. What's the data dependency?

Three layers. The model itself — which provider, which version, hosted where. Any fine-tuning or adapter the team has built on top. And any retrieved corpus the prototype relies on at query time, whether that is a vector store, a document index, or a database the model reads from through a tool.

Each layer is a dependency that can change unpredictably and break the prototype. The model can be deprecated, the adapter can drift as the underlying model shifts, the corpus can fall out of date. The team should know all three layers and have a view on what to do if any of them changes. The view does not have to be elaborate, but the absence of one is a tell.

7. What changes when the model changes?

Frontier models are released roughly every six months, and each release changes prompt behaviour subtly. Outputs get longer or shorter, refusals get more or less frequent, formatting drifts, edge cases that the previous prompt handled cleanly start to misfire. A prototype that depends on a specific model's exact behaviour will need re-tuning when the model changes.

Ask whether the evaluation harness can re-run automatically against a new model and produce a comparison. If it can, the team has a tool for adapting; if it cannot, every model upgrade will be a small project in itself. The cost of that project, repeated twice a year, is usually larger than the cost of building the harness once.

8. What does "in production" actually mean here?

Three things, at minimum. Observability: what is logged when a real user asks a real question. Regression detection: what would alert the team if quality dropped — slowly, over a week, or sharply, after a deploy. Rollback: what happens if a release goes wrong, how quickly can the team revert, and whether the old version is still available to serve traffic while the new one is being investigated.

A prototype with no answer to any of these is not ready for production; it is ready for a demo. The gap between the two is most of the engineering work, and it is the work that is least visible from the outside. A team that can describe their plan for all three is usually closer to shipping than one whose prototype looks more polished but has none of the surrounding scaffolding.

Eight questions. None of them require technical sophistication from the asker; all of them are answerable in a sentence or two by anyone who actually built the prototype. A team that cannot answer any one of them is not necessarily building a bad prototype, but it is building one that has not been thought through yet — and that is something a non-technical stakeholder can usefully flag, before the prototype becomes a commitment that is harder to walk back.