The Evaluate tab helps you systematically test and measure your agent’s performance across a set of test questions. For basic interactive testing, see Testing Your Agent. You can use this feature to monitor response quality over time and gauge the impact of changes such as prompt updates, data additions, or configuration adjustments.
Use the “Add Test Case(s)” button to create test questions. You can optionally provide expected answers for reference, though these are not required for evaluation.

In addition to manually adding test cases, you can use AI to automatically generate a diverse set of evaluation questions with scoring rubrics. Click the “Generate Test Cases” button (marked with a sparkle icon) to open the generation modal.

The generator examines your agent’s full configuration — including its system prompt, available actions, connected document sources, and MCP servers — and uses an ensemble of AI models to produce 8–12 high-quality test cases. Each generated test case includes:
The generated questions are designed to cover a range of scenarios: straightforward factual retrieval, complex multi-step reasoning, action-triggering workflows, ambiguous or vaguely worded inputs, and adversarial or out-of-scope requests.
When generating test cases, you can choose which document sources inform the questions:
Once generation starts, an ensemble of AI models analyzes your agent’s configuration and produces test cases. This may take up to a minute.

Once generated, the test cases appear in your Evaluate tab alongside any manually created ones. You can edit the questions and rubrics, or delete any that don’t fit your needs.
You can turn real conversations from your agent’s Monitor tab into test cases. This is useful when you spot interactions worth tracking — good responses you want to preserve as golden examples, or bad ones you want to catch in future regressions.
The test cases will appear in your Evaluate tab, ready to be scored in your next evaluation run.
For each test question, you can define a custom rubric that specifies what makes a good answer. Rubrics allow you to:
Click “Generate AI Answers” to test your agent against all questions. The agent will generate responses and automatically score them based on your defined rubrics.

After a run completes, the Agent Reliability Score (ARS) provides a composite 0–100 reliability score across dimensions like accuracy, consistency, source fidelity, and cost/latency efficiency.
To compare how different model configurations perform against the same test cases, use the Benchmark tab.
Note: The Evaluate tab currently runs tests on your Preview version of the agent.
Watch a video walking you through how to use the Evaluate tab below:
We’re actively developing additional evaluation capabilities. If you’d like early access to any of these features, please contact us at support@credal.ai: