Create Your First Experiment
This guide walks you through creating an Experiment custom resource that evaluates an agent deployed in your cluster. The Testbench operator reconciles the Experiment into the underlying Testkube TestWorkflow, dataset ConfigMap, and (optional) TestTrigger for you — so a single YAML is all you write.
Prerequisites
-
Testbench installed (see Install the Testbench)
-
An
Agentreachable via the A2A protocol deployed in the cluster -
An
AiGatewaydeployed in the cluster (used as the LLM-as-a-judge endpoint) -
An OTLP collector endpoint reachable from the
testkubenamespace -
Testkube CLI installed (only needed for the optional manual run in Step 3)
Understand the model
An Experiment is a custom resource that describes what to test. The operator translates it into the resources Testkube needs to run the evaluation pipeline:
-
agentRef— which agent to evaluate -
aiGatewayRef— which AI Gateway provides the judge model -
dataset— the scenarios and metrics to evaluate (inline, URL, or S3) -
env— extra environment variables for the pipeline pods (e.g.OTEL_EXPORTER_OTLP_ENDPOINT) -
trigger(optional) — re-run automatically when the referenced agent is redeployed -
schedule(optional) — re-run on a cron schedule
Each Experiment reconciles into one TestWorkflow plus a generated dataset ConfigMap. You do not need to write either of them by hand.
Step 1: Define your Experiment
Scenarios and metrics live under dataset.inline. The hierarchy is three levels:
-
Experiment — top-level configuration (judge model, default threshold)
-
Scenario — a named group of steps (e.g., "Weather in New York")
-
Step — a single query with expected reference data and metrics to evaluate
-
-
Create the Experiment:
apiVersion: testbench.agentic-layer.ai/v1alpha1
kind: Experiment
metadata:
name: example-experiment (1)
namespace: testkube
spec:
agentRef: (2)
name: weather-agent
namespace: sample-agents
aiGatewayRef: (3)
name: ai-gateway
namespace: ai-gateway
env: (4)
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://lgtm.monitoring.svc.cluster.local:4318"
dataset:
inline: (5)
llmAsAJudgeModel: gemini-2.5-flash-lite
defaultThreshold: 0.9
scenarios:
- name: "Weather in New York"
steps:
- input: "What is the weather like in New York right now?"
reference:
toolCalls:
- name: get_weather
args:
city: "New York"
topics:
- weather
metrics:
- metricName: AgentGoalAccuracyWithoutReference
- metricName: ToolCallAccuracy
- metricName: TopicAdherence
parameters:
mode: precision
| 1 | A unique name — used as the prefix for the generated TestWorkflow and ConfigMap. |
| 2 | The A2A-capable Agent to evaluate. The operator resolves its endpoint automatically. |
| 3 | The AiGateway providing the LLM judge. The operator injects its base URL into the evaluate phase. |
| 4 | Pipeline-pod environment variables. The publish phase reads OTEL_EXPORTER_OTLP_ENDPOINT from here. |
| 5 | Inline dataset. Fields use Kubernetes camelCase (llmAsAJudgeModel, toolCalls, metricName). |
Apply it:
kubectl apply -f example-experiment.yaml
The operator immediately reconciles the Experiment into a TestWorkflow named <experiment-name>-<experiment-namespace>-workflow in the testkube namespace, plus a sibling ConfigMap holding the rendered Experiment JSON.
Available metrics
The following table lists commonly used metrics provided by RAGAS, the default framework adapter. All metrics are resolved through the GenericMetricsRegistry, which supports pluggable adapters — you can extend the system with custom metrics by implementing your own FrameworkAdapter.
| Metric | Description | Required reference fields |
|---|---|---|
|
Whether the agent achieved its goal, judged without a reference answer |
None |
|
Whether the agent called the correct tools with the correct arguments |
|
|
Whether the response stays on the specified topics |
|
|
Whether the response is grounded in the retrieved context (no hallucination) |
|
Step 2: Verify the generated resources
Confirm the operator produced a TestWorkflow for your Experiment:
kubectl get testworkflows -n testkube -l testbench.agentic-layer.ai/experiment=example-experiment
Inspect the Experiment status if anything looks off — the operator records reconcile errors as events:
kubectl describe experiment example-experiment -n testkube
Step 3: Run the workflow
Trigger the generated TestWorkflow once, on demand:
kubectl testkube run testworkflow <generated-workflow-name> --watch
Replace <generated-workflow-name> with the value from Step 2. If you configured trigger or schedule in the Experiment (see Auto-trigger on agent deployment and Run on a cron schedule), the operator runs the workflow automatically — you can skip this manual invocation.
Step 4: View results
Grafana dashboards
If you installed the Grafana dashboard ConfigMap (see Install the Testbench), open Grafana and look for the Testkube Evaluation dashboard. It displays per-metric scores filtered by workflow name.
HTML report artifact
The visualize phase produces a self-contained HTML report as a workflow artifact. Download it with:
kubectl testkube download artifacts <generated-workflow-name>-1
The report includes:
-
Summary cards with total samples and metrics count
-
Horizontal bar charts showing mean score per metric
-
Metric distribution histograms with statistics
-
A searchable, sortable results table with all evaluations
Load a dataset from S3/MinIO
Swap dataset.inline for dataset.s3 to pull the dataset from an S3-compatible store at run time. The operator injects the MinIO credentials into the setup phase via env:
apiVersion: testbench.agentic-layer.ai/v1alpha1
kind: Experiment
metadata:
name: example-experiment
namespace: testkube
spec:
agentRef:
name: weather-agent
namespace: sample-agents
aiGatewayRef:
name: ai-gateway
namespace: ai-gateway
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://lgtm.monitoring.svc.cluster.local:4318"
- name: MINIO_ENDPOINT
value: "http://minio.storage:9000"
- name: MINIO_ROOT_USER
value: "minioadmin"
- name: MINIO_ROOT_PASSWORD
value: "minioadmin"
dataset:
s3:
bucket: datasets
key: weather.csv
For HTTP datasets, use dataset.url: "https://example.com/dataset.csv" in place of dataset.s3.
Auto-trigger on agent deployment
Add a trigger block to re-evaluate the agent every time it is redeployed. The operator creates and manages the underlying Testkube TestTrigger for you:
spec:
trigger:
enabled: true
concurrencyPolicy: Allow # Allow | Forbid | Replace
Set enabled: false to pause auto-execution without deleting the Experiment.
Run on a cron schedule
Add a schedule block to run the evaluation on a fixed cadence:
spec:
schedule:
cron: "0 3 * * *" # daily at 03:00
timezone: "Europe/Berlin"
trigger and schedule are independent — combine them to run on both events and a regular cadence.
Multi-turn conversation scenarios
To test multi-turn conversations, add multiple steps to a single scenario. The A2A protocol maintains conversation context across steps via a contextId:
spec:
dataset:
inline:
llmAsAJudgeModel: gemini-2.5-flash-lite
defaultThreshold: 0.9
scenarios:
- name: "Weather then time in New York"
steps:
- input: "What is the weather like in New York right now?"
reference:
toolCalls:
- name: get_weather
args: { city: "New York" }
metrics:
- metricName: AgentGoalAccuracyWithoutReference
- metricName: ToolCallAccuracy
- input: "What time is it in New York?"
reference:
toolCalls:
- name: get_current_time
args: { city: "New York" }
metrics:
- metricName: AgentGoalAccuracyWithoutReference
- metricName: ToolCallAccuracy
Each step within the scenario is sent sequentially to the agent. The agent receives the same contextId for all steps, allowing it to maintain conversation state.
Next steps
-
For local CLI evaluations without Kubernetes, see Run Testbench Standalone.
-
Learn about the internal data flow and evaluation architecture in Building Block View.
-
Explore cross-cutting concerns like tracing and observability in Cross-cutting Concepts.