Evaluation and testing - Zelix Glossary

What it means

AI evaluation is not the same as software testing. The output is not pass-or-fail; it is a graded answer that might be 80 percent right. Evaluation captures that: an eval set of 100 to 500 representative inputs, each with a model output, scored against a rubric (accuracy, tone, completeness, format).

You run evaluations at three moments: before launch (does the agent meet the bar?), after every prompt or model change (did anything regress?), and periodically in production (is the agent still as good as it was?). The discipline is having the eval set in the first place.

Why it matters

Without evaluation, an AI deployment runs on vibes. Somebody tested it on a Monday and it seemed fine, so it shipped. Two months later it has quietly degraded and nobody noticed until a customer complained. Evaluation is what gives you the early-warning signal.

It is also what makes model swaps safe. A new model lands, you run the eval set, you compare the new score against the old one, and you only promote the new model if it wins. Without an eval set you are guessing.

Example

A clinic builds an eval set of 220 historical patient queries with the right replies. Every prompt change re-runs the set. When OpenAI ships a new model, they run the same set and see the new model scores higher on tone but worse on appointment-time accuracy. They keep the old model for booking flows and use the new model for general-question flows.

What it means

Why it matters

Example

Related terms

Where this comes up