Create Your First TestWorkflow

This guide walks you through defining an experiment, creating a TestWorkflow, and running it against an agent deployed in your cluster.

Prerequisites

Testbench installed (see Install the Testbench)
An agent with an A2A protocol endpoint deployed in the cluster
An AI Gateway deployed in your cluster
An OTLP collector endpoint reachable from the testkube namespace
Testkube CLI installed

Understand the pipeline

The Testbench evaluates agents through a pipeline. Each phase is a reusable TestWorkflowTemplate:

Run — sends queries to the agent via the A2A protocol and records responses
Evaluate — scores responses using LLM-as-a-judge metrics
Publish — sends evaluation scores to an OTLP-compatible observability backend
Visualize — generates a self-contained HTML report as a workflow artifact

This guide uses a ConfigMap-based experiment, which is the simplest way to get started. For loading datasets from external sources, see Load a dataset from S3/MinIO (alternative to Step 1) below.

Step 1: Define your experiment

An experiment is a JSON document that describes what to test. It follows a three-level hierarchy:

Experiment — top-level configuration (LLM model, default threshold)
- Scenario — a named group of steps (e.g., "Weather in New York")
  - Step — a single query with expected reference data and metrics to evaluate

Create a ConfigMap containing your experiment:

apiVersion: v1
kind: ConfigMap
metadata:
  name: experiment
  namespace: testkube
data:
  experiment.json: |
    {
      "llm_as_a_judge_model": "gemini-2.5-flash-lite",
      "default_threshold": 0.9,
      "scenarios": [
        {
          "name": "Weather in New York",
          "steps": [
            {
              "input": "What is the weather like in New York right now?",
              "reference": {
                "tool_calls": [
                  {
                    "name": "get_weather",
                    "args": {
                      "city": "New York"
                    }
                  }
                ],
                "topics": ["weather"]
              },
              "metrics": [
                {
                  "metric_name": "AgentGoalAccuracyWithoutReference"
                },
                {
                  "metric_name": "ToolCallAccuracy"
                },
                {
                  "metric_name": "TopicAdherence",
                  "parameters": {
                    "mode": "precision"
                  }
                }
              ]
            }
          ]
        }
      ]
    }

Apply it:

kubectl apply -f experiment.yaml

Available metrics

The following table lists commonly used metrics provided by RAGAS, the default framework adapter. All metrics are resolved through the GenericMetricsRegistry, which supports pluggable adapters — you can extend the system with custom metrics by implementing your own FrameworkAdapter.

Metric Description Required reference fields

Metric	Description	Required reference fields
`AgentGoalAccuracyWithoutReference`	Whether the agent achieved its goal, judged without a reference answer	None
`ToolCallAccuracy`	Whether the agent called the correct tools with the correct arguments	`reference.tool_calls`
`TopicAdherence`	Whether the response stays on the specified topics	`reference.topics`

AgentGoalAccuracyWithoutReference

Whether the agent achieved its goal, judged without a reference answer

None

ToolCallAccuracy

Whether the agent called the correct tools with the correct arguments

reference.tool_calls

TopicAdherence

Whether the response stays on the specified topics

reference.topics

Step 2: Configure the OTLP endpoint

Create a ConfigMap that tells the pipeline where to send metrics:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-config
  namespace: testkube
data:
  OTEL_EXPORTER_OTLP_ENDPOINT: "http://lgtm.monitoring.svc.cluster.local:4318"

Apply it:

kubectl apply -f otel-config.yaml

Step 3: Create the TestWorkflow

The TestWorkflow ties everything together. It mounts the experiment ConfigMap, injects the OTLP endpoint, and chains the pipeline templates:

apiVersion: testworkflows.testkube.io/v1
kind: TestWorkflow
metadata:
  name: example-workflow (1)
  namespace: testkube
  labels:
    testkube.io/test-category: ragas-evaluation
    app: testworkflows

spec:
  content:
    files:
      - path: /data/datasets/experiment.json (2)
        contentFrom:
          configMapKeyRef:
            name: experiment
            key: experiment.json

  container:
    env:
      - name: OTEL_EXPORTER_OTLP_ENDPOINT (3)
        valueFrom:
          configMapKeyRef:
            name: otel-config
            key: OTEL_EXPORTER_OTLP_ENDPOINT

  use:
  - name: run-template (4)
    config:
      agentUrl: "http://weather-agent.sample-agents:8000" (5)
  - name: evaluate-template
  - name: publish-template
  - name: visualize-template

1	A unique name for your workflow
2	Mounts the experiment JSON from the ConfigMap into the shared data volume
3	Injects the OTLP endpoint as an environment variable for the publish phase
4	Templates are executed in order: run → evaluate → publish → visualize
5	The A2A endpoint of the agent you want to evaluate

Apply it:

kubectl apply -f example-workflow.yaml

Step 4: Run and monitor the workflow

Start the workflow:

kubectl testkube run testworkflow example-workflow --watch

View logs after completion:

kubectl testkube get testworkflow example-workflow-1

Step 5: View results

Grafana dashboards

If you installed the Grafana dashboard ConfigMap (see Install the Testbench), open Grafana and look for the Testkube Evaluation dashboard. It displays per-metric scores filtered by workflow name.

HTML report artifact

The visualize phase produces a self-contained HTML report as a workflow artifact. Download it with:

kubectl testkube download artifacts example-workflow-1

The report includes:

Summary cards with total samples and metrics count
Horizontal bar charts showing mean score per metric
Metric distribution histograms with statistics
A searchable, sortable results table with all evaluations

Load a dataset from S3/MinIO (alternative to Step 1)

Instead of embedding the experiment in a ConfigMap, you can load a dataset from an S3-compatible store using the setup-template. Replace the content.files and prepend setup-template to the use list:

apiVersion: testworkflows.testkube.io/v1
kind: TestWorkflow
metadata:
  name: s3-workflow
  namespace: testkube
spec:
  container:
    env:
      - name: OTEL_EXPORTER_OTLP_ENDPOINT
        valueFrom:
          configMapKeyRef:
            name: otel-config
            key: OTEL_EXPORTER_OTLP_ENDPOINT
      - name: MINIO_ENDPOINT
        value: "http://minio.storage:9000"
      - name: MINIO_ROOT_USER
        value: "minioadmin"
      - name: MINIO_ROOT_PASSWORD
        value: "minioadmin"

  use:
  - name: setup-template
    config:
      datasetUrl: "http://data-server.data-server:8000/dataset.csv"
  - name: run-template
    config:
      agentUrl: "http://weather-agent.sample-agents:8000"
  - name: evaluate-template
  - name: publish-template
  - name: visualize-template

The setup-template downloads the dataset.

Auto-trigger on agent deployment (optional)

You can automatically run the evaluation workflow whenever the agent under test is redeployed. Create a TestTrigger:

apiVersion: tests.testkube.io/v1
kind: TestTrigger
metadata:
  name: example-workflow-trigger
  namespace: testkube
spec:
  resource: deployment
  resourceSelector:
    name: weather-agent
    namespace: sample-agents
  event: modified
  action: run
  execution: testworkflow
  concurrencyPolicy: allow
  testSelector:
    name: example-workflow
    namespace: testkube
  disabled: false

This trigger watches the weather-agent Deployment in the sample-agents namespace and runs the workflow on every modification.

Multi-turn conversation scenarios

To test multi-turn conversations, add multiple steps to a single scenario. The A2A protocol maintains conversation context across steps via a context_id:

{
  "llm_as_a_judge_model": "gemini-2.5-flash-lite",
  "default_threshold": 0.9,
  "scenarios": [
    {
      "name": "Weather then time in New York",
      "steps": [
        {
          "input": "What is the weather like in New York right now?",
          "reference": {
            "tool_calls": [
              { "name": "get_weather", "args": { "city": "New York" } }
            ]
          },
          "metrics": [
            { "metric_name": "AgentGoalAccuracyWithoutReference" },
            { "metric_name": "ToolCallAccuracy" }
          ]
        },
        {
          "input": "What time is it in New York?",
          "reference": {
            "tool_calls": [
              { "name": "get_current_time", "args": { "city": "New York" } }
            ]
          },
          "metrics": [
            { "metric_name": "AgentGoalAccuracyWithoutReference" },
            { "metric_name": "ToolCallAccuracy" }
          ]
        }
      ]
    }
  ]
}

Each step within the scenario is sent sequentially to the agent. The agent receives the same context_id for all steps, allowing it to maintain conversation state.

Next steps

Learn about the internal data flow and evaluation architecture in Building Block View
Explore cross-cutting concerns like tracing and observability in Cross-cutting Concepts