Create Your First TestWorkflow

This guide walks you through defining an experiment, creating a TestWorkflow, and running it against an agent deployed in your cluster.

Prerequisites

  • Testbench installed (see Install the Testbench)

  • An agent with an A2A protocol endpoint deployed in the cluster

  • An AI Gateway deployed in your cluster

  • An OTLP collector endpoint reachable from the testkube namespace

  • Testkube CLI installed

Understand the pipeline

The Testbench evaluates agents through a pipeline. Each phase is a reusable TestWorkflowTemplate:

  1. Run — sends queries to the agent via the A2A protocol and records responses

  2. Evaluate — scores responses using LLM-as-a-judge metrics

  3. Publish — sends evaluation scores to an OTLP-compatible observability backend

  4. Visualize — generates a self-contained HTML report as a workflow artifact

This guide uses a ConfigMap-based experiment, which is the simplest way to get started. For loading datasets from external sources, see Load a dataset from S3/MinIO (alternative to Step 1) below.

Step 1: Define your experiment

An experiment is a JSON document that describes what to test. It follows a three-level hierarchy:

  • Experiment — top-level configuration (LLM model, default threshold)

    • Scenario — a named group of steps (e.g., "Weather in New York")

      • Step — a single query with expected reference data and metrics to evaluate

Create a ConfigMap containing your experiment:

apiVersion: v1
kind: ConfigMap
metadata:
  name: experiment
  namespace: testkube
data:
  experiment.json: |
    {
      "llm_as_a_judge_model": "gemini-2.5-flash-lite",
      "default_threshold": 0.9,
      "scenarios": [
        {
          "name": "Weather in New York",
          "steps": [
            {
              "input": "What is the weather like in New York right now?",
              "reference": {
                "tool_calls": [
                  {
                    "name": "get_weather",
                    "args": {
                      "city": "New York"
                    }
                  }
                ],
                "topics": ["weather"]
              },
              "metrics": [
                {
                  "metric_name": "AgentGoalAccuracyWithoutReference"
                },
                {
                  "metric_name": "ToolCallAccuracy"
                },
                {
                  "metric_name": "TopicAdherence",
                  "parameters": {
                    "mode": "precision"
                  }
                }
              ]
            }
          ]
        }
      ]
    }

Apply it:

kubectl apply -f experiment.yaml

Available metrics

The following table lists commonly used metrics provided by RAGAS, the default framework adapter. All metrics are resolved through the GenericMetricsRegistry, which supports pluggable adapters — you can extend the system with custom metrics by implementing your own FrameworkAdapter.

Metric Description Required reference fields

AgentGoalAccuracyWithoutReference

Whether the agent achieved its goal, judged without a reference answer

None

ToolCallAccuracy

Whether the agent called the correct tools with the correct arguments

reference.tool_calls

TopicAdherence

Whether the response stays on the specified topics

reference.topics

Step 2: Configure the OTLP endpoint

Create a ConfigMap that tells the pipeline where to send metrics:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-config
  namespace: testkube
data:
  OTEL_EXPORTER_OTLP_ENDPOINT: "http://lgtm.monitoring.svc.cluster.local:4318"

Apply it:

kubectl apply -f otel-config.yaml

Step 3: Create the TestWorkflow

The TestWorkflow ties everything together. It mounts the experiment ConfigMap, injects the OTLP endpoint, and chains the pipeline templates:

apiVersion: testworkflows.testkube.io/v1
kind: TestWorkflow
metadata:
  name: example-workflow (1)
  namespace: testkube
  labels:
    testkube.io/test-category: ragas-evaluation
    app: testworkflows

spec:
  content:
    files:
      - path: /data/datasets/experiment.json (2)
        contentFrom:
          configMapKeyRef:
            name: experiment
            key: experiment.json

  container:
    env:
      - name: OTEL_EXPORTER_OTLP_ENDPOINT (3)
        valueFrom:
          configMapKeyRef:
            name: otel-config
            key: OTEL_EXPORTER_OTLP_ENDPOINT

  use:
  - name: run-template (4)
    config:
      agentUrl: "http://weather-agent.sample-agents:8000" (5)
  - name: evaluate-template
  - name: publish-template
  - name: visualize-template
1 A unique name for your workflow
2 Mounts the experiment JSON from the ConfigMap into the shared data volume
3 Injects the OTLP endpoint as an environment variable for the publish phase
4 Templates are executed in order: run → evaluate → publish → visualize
5 The A2A endpoint of the agent you want to evaluate

Apply it:

kubectl apply -f example-workflow.yaml

Step 4: Run and monitor the workflow

Start the workflow:

kubectl testkube run testworkflow example-workflow --watch

View logs after completion:

kubectl testkube get testworkflow example-workflow-1

Step 5: View results

Grafana dashboards

If you installed the Grafana dashboard ConfigMap (see Install the Testbench), open Grafana and look for the Testkube Evaluation dashboard. It displays per-metric scores filtered by workflow name.

HTML report artifact

The visualize phase produces a self-contained HTML report as a workflow artifact. Download it with:

kubectl testkube download artifacts example-workflow-1

The report includes:

  • Summary cards with total samples and metrics count

  • Horizontal bar charts showing mean score per metric

  • Metric distribution histograms with statistics

  • A searchable, sortable results table with all evaluations

Load a dataset from S3/MinIO (alternative to Step 1)

Instead of embedding the experiment in a ConfigMap, you can load a dataset from an S3-compatible store using the setup-template. Replace the content.files and prepend setup-template to the use list:

apiVersion: testworkflows.testkube.io/v1
kind: TestWorkflow
metadata:
  name: s3-workflow
  namespace: testkube
spec:
  container:
    env:
      - name: OTEL_EXPORTER_OTLP_ENDPOINT
        valueFrom:
          configMapKeyRef:
            name: otel-config
            key: OTEL_EXPORTER_OTLP_ENDPOINT
      - name: MINIO_ENDPOINT
        value: "http://minio.storage:9000"
      - name: MINIO_ROOT_USER
        value: "minioadmin"
      - name: MINIO_ROOT_PASSWORD
        value: "minioadmin"

  use:
  - name: setup-template
    config:
      datasetUrl: "http://data-server.data-server:8000/dataset.csv"
  - name: run-template
    config:
      agentUrl: "http://weather-agent.sample-agents:8000"
  - name: evaluate-template
  - name: publish-template
  - name: visualize-template

The setup-template downloads the dataset.

Auto-trigger on agent deployment (optional)

You can automatically run the evaluation workflow whenever the agent under test is redeployed. Create a TestTrigger:

apiVersion: tests.testkube.io/v1
kind: TestTrigger
metadata:
  name: example-workflow-trigger
  namespace: testkube
spec:
  resource: deployment
  resourceSelector:
    name: weather-agent
    namespace: sample-agents
  event: modified
  action: run
  execution: testworkflow
  concurrencyPolicy: allow
  testSelector:
    name: example-workflow
    namespace: testkube
  disabled: false

This trigger watches the weather-agent Deployment in the sample-agents namespace and runs the workflow on every modification.

Multi-turn conversation scenarios

To test multi-turn conversations, add multiple steps to a single scenario. The A2A protocol maintains conversation context across steps via a context_id:

{
  "llm_as_a_judge_model": "gemini-2.5-flash-lite",
  "default_threshold": 0.9,
  "scenarios": [
    {
      "name": "Weather then time in New York",
      "steps": [
        {
          "input": "What is the weather like in New York right now?",
          "reference": {
            "tool_calls": [
              { "name": "get_weather", "args": { "city": "New York" } }
            ]
          },
          "metrics": [
            { "metric_name": "AgentGoalAccuracyWithoutReference" },
            { "metric_name": "ToolCallAccuracy" }
          ]
        },
        {
          "input": "What time is it in New York?",
          "reference": {
            "tool_calls": [
              { "name": "get_current_time", "args": { "city": "New York" } }
            ]
          },
          "metrics": [
            { "metric_name": "AgentGoalAccuracyWithoutReference" },
            { "metric_name": "ToolCallAccuracy" }
          ]
        }
      ]
    }
  ]
}

Each step within the scenario is sent sequentially to the agent. The agent receives the same context_id for all steps, allowing it to maintain conversation state.

Next steps