Create Your First Experiment

This guide walks you through creating an Experiment custom resource that evaluates an agent deployed in your cluster. The Testbench operator reconciles the Experiment into the underlying Testkube TestWorkflow, dataset ConfigMap, and (optional) TestTrigger for you — so a single YAML is all you write.

Prerequisites

  • Testbench installed (see Install the Testbench)

  • An Agent reachable via the A2A protocol deployed in the cluster

  • An AiGateway deployed in the cluster (used as the LLM-as-a-judge endpoint)

  • An OTLP collector endpoint reachable from the testkube namespace

  • Testkube CLI installed (only needed for the optional manual run in Step 3)

Understand the model

An Experiment is a custom resource that describes what to test. The operator translates it into the resources Testkube needs to run the evaluation pipeline:

  • agentRef — which agent to evaluate

  • aiGatewayRef — which AI Gateway provides the judge model

  • dataset — the scenarios and metrics to evaluate (inline, URL, or S3)

  • env — extra environment variables for the pipeline pods (e.g. OTEL_EXPORTER_OTLP_ENDPOINT)

  • trigger (optional) — re-run automatically when the referenced agent is redeployed

  • schedule (optional) — re-run on a cron schedule

Each Experiment reconciles into one TestWorkflow plus a generated dataset ConfigMap. You do not need to write either of them by hand.

Step 1: Define your Experiment

Scenarios and metrics live under dataset.inline. The hierarchy is three levels:

  • Experiment — top-level configuration (judge model, default threshold)

    • Scenario — a named group of steps (e.g., "Weather in New York")

      • Step — a single query with expected reference data and metrics to evaluate

Create the Experiment:

apiVersion: testbench.agentic-layer.ai/v1alpha1
kind: Experiment
metadata:
  name: example-experiment (1)
  namespace: testkube
spec:
  agentRef: (2)
    name: weather-agent
    namespace: sample-agents
  aiGatewayRef: (3)
    name: ai-gateway
    namespace: ai-gateway
  env: (4)
    - name: OTEL_EXPORTER_OTLP_ENDPOINT
      value: "http://lgtm.monitoring.svc.cluster.local:4318"
  dataset:
    inline: (5)
      llmAsAJudgeModel: gemini-2.5-flash-lite
      defaultThreshold: 0.9
      scenarios:
        - name: "Weather in New York"
          steps:
            - input: "What is the weather like in New York right now?"
              reference:
                toolCalls:
                  - name: get_weather
                    args:
                      city: "New York"
                topics:
                  - weather
              metrics:
                - metricName: AgentGoalAccuracyWithoutReference
                - metricName: ToolCallAccuracy
                - metricName: TopicAdherence
                  parameters:
                    mode: precision
1 A unique name — used as the prefix for the generated TestWorkflow and ConfigMap.
2 The A2A-capable Agent to evaluate. The operator resolves its endpoint automatically.
3 The AiGateway providing the LLM judge. The operator injects its base URL into the evaluate phase.
4 Pipeline-pod environment variables. The publish phase reads OTEL_EXPORTER_OTLP_ENDPOINT from here.
5 Inline dataset. Fields use Kubernetes camelCase (llmAsAJudgeModel, toolCalls, metricName).

Apply it:

kubectl apply -f example-experiment.yaml

The operator immediately reconciles the Experiment into a TestWorkflow named <experiment-name>-<experiment-namespace>-workflow in the testkube namespace, plus a sibling ConfigMap holding the rendered Experiment JSON.

Available metrics

The following table lists commonly used metrics provided by RAGAS, the default framework adapter. All metrics are resolved through the GenericMetricsRegistry, which supports pluggable adapters — you can extend the system with custom metrics by implementing your own FrameworkAdapter.

Metric Description Required reference fields

AgentGoalAccuracyWithoutReference

Whether the agent achieved its goal, judged without a reference answer

None

ToolCallAccuracy

Whether the agent called the correct tools with the correct arguments

reference.toolCalls

TopicAdherence

Whether the response stays on the specified topics

reference.topics

Faithfulness

Whether the response is grounded in the retrieved context (no hallucination)

reference.retrievedContexts

Step 2: Verify the generated resources

Confirm the operator produced a TestWorkflow for your Experiment:

kubectl get testworkflows -n testkube -l testbench.agentic-layer.ai/experiment=example-experiment

Inspect the Experiment status if anything looks off — the operator records reconcile errors as events:

kubectl describe experiment example-experiment -n testkube

Step 3: Run the workflow

Trigger the generated TestWorkflow once, on demand:

kubectl testkube run testworkflow <generated-workflow-name> --watch

Replace <generated-workflow-name> with the value from Step 2. If you configured trigger or schedule in the Experiment (see Auto-trigger on agent deployment and Run on a cron schedule), the operator runs the workflow automatically — you can skip this manual invocation.

Step 4: View results

Grafana dashboards

If you installed the Grafana dashboard ConfigMap (see Install the Testbench), open Grafana and look for the Testkube Evaluation dashboard. It displays per-metric scores filtered by workflow name.

HTML report artifact

The visualize phase produces a self-contained HTML report as a workflow artifact. Download it with:

kubectl testkube download artifacts <generated-workflow-name>-1

The report includes:

  • Summary cards with total samples and metrics count

  • Horizontal bar charts showing mean score per metric

  • Metric distribution histograms with statistics

  • A searchable, sortable results table with all evaluations

Load a dataset from S3/MinIO

Swap dataset.inline for dataset.s3 to pull the dataset from an S3-compatible store at run time. The operator injects the MinIO credentials into the setup phase via env:

apiVersion: testbench.agentic-layer.ai/v1alpha1
kind: Experiment
metadata:
  name: example-experiment
  namespace: testkube
spec:
  agentRef:
    name: weather-agent
    namespace: sample-agents
  aiGatewayRef:
    name: ai-gateway
    namespace: ai-gateway
  env:
    - name: OTEL_EXPORTER_OTLP_ENDPOINT
      value: "http://lgtm.monitoring.svc.cluster.local:4318"
    - name: MINIO_ENDPOINT
      value: "http://minio.storage:9000"
    - name: MINIO_ROOT_USER
      value: "minioadmin"
    - name: MINIO_ROOT_PASSWORD
      value: "minioadmin"
  dataset:
    s3:
      bucket: datasets
      key: weather.csv

For HTTP datasets, use dataset.url: "https://example.com/dataset.csv" in place of dataset.s3.

Auto-trigger on agent deployment

Add a trigger block to re-evaluate the agent every time it is redeployed. The operator creates and manages the underlying Testkube TestTrigger for you:

spec:
  trigger:
    enabled: true
    concurrencyPolicy: Allow # Allow | Forbid | Replace

Set enabled: false to pause auto-execution without deleting the Experiment.

Run on a cron schedule

Add a schedule block to run the evaluation on a fixed cadence:

spec:
  schedule:
    cron: "0 3 * * *"    # daily at 03:00
    timezone: "Europe/Berlin"

trigger and schedule are independent — combine them to run on both events and a regular cadence.

Multi-turn conversation scenarios

To test multi-turn conversations, add multiple steps to a single scenario. The A2A protocol maintains conversation context across steps via a contextId:

spec:
  dataset:
    inline:
      llmAsAJudgeModel: gemini-2.5-flash-lite
      defaultThreshold: 0.9
      scenarios:
        - name: "Weather then time in New York"
          steps:
            - input: "What is the weather like in New York right now?"
              reference:
                toolCalls:
                  - name: get_weather
                    args: { city: "New York" }
              metrics:
                - metricName: AgentGoalAccuracyWithoutReference
                - metricName: ToolCallAccuracy
            - input: "What time is it in New York?"
              reference:
                toolCalls:
                  - name: get_current_time
                    args: { city: "New York" }
              metrics:
                - metricName: AgentGoalAccuracyWithoutReference
                - metricName: ToolCallAccuracy

Each step within the scenario is sent sequentially to the agent. The agent receives the same contextId for all steps, allowing it to maintain conversation state.

Next steps