Create Your First TestWorkflow
This guide walks you through defining an experiment, creating a TestWorkflow, and running it against an agent deployed in your cluster.
Prerequisites
-
Testbench installed (see Install the Testbench)
-
An agent with an A2A protocol endpoint deployed in the cluster
-
An AI Gateway deployed in your cluster
-
An OTLP collector endpoint reachable from the
testkubenamespace -
Testkube CLI installed
Understand the pipeline
The Testbench evaluates agents through a pipeline. Each phase is a reusable TestWorkflowTemplate:
-
Run — sends queries to the agent via the A2A protocol and records responses
-
Evaluate — scores responses using LLM-as-a-judge metrics
-
Publish — sends evaluation scores to an OTLP-compatible observability backend
-
Visualize — generates a self-contained HTML report as a workflow artifact
| This guide uses a ConfigMap-based experiment, which is the simplest way to get started. For loading datasets from external sources, see Load a dataset from S3/MinIO (alternative to Step 1) below. |
Step 1: Define your experiment
An experiment is a JSON document that describes what to test. It follows a three-level hierarchy:
-
Experiment — top-level configuration (LLM model, default threshold)
-
Scenario — a named group of steps (e.g., "Weather in New York")
-
Step — a single query with expected reference data and metrics to evaluate
-
-
Create a ConfigMap containing your experiment:
apiVersion: v1
kind: ConfigMap
metadata:
name: experiment
namespace: testkube
data:
experiment.json: |
{
"llm_as_a_judge_model": "gemini-2.5-flash-lite",
"default_threshold": 0.9,
"scenarios": [
{
"name": "Weather in New York",
"steps": [
{
"input": "What is the weather like in New York right now?",
"reference": {
"tool_calls": [
{
"name": "get_weather",
"args": {
"city": "New York"
}
}
],
"topics": ["weather"]
},
"metrics": [
{
"metric_name": "AgentGoalAccuracyWithoutReference"
},
{
"metric_name": "ToolCallAccuracy"
},
{
"metric_name": "TopicAdherence",
"parameters": {
"mode": "precision"
}
}
]
}
]
}
]
}
Apply it:
kubectl apply -f experiment.yaml
Available metrics
The following table lists commonly used metrics provided by RAGAS, the default framework adapter. All metrics are resolved through the GenericMetricsRegistry, which supports pluggable adapters — you can extend the system with custom metrics by implementing your own FrameworkAdapter.
| Metric | Description | Required reference fields |
|---|---|---|
|
Whether the agent achieved its goal, judged without a reference answer |
None |
|
Whether the agent called the correct tools with the correct arguments |
|
|
Whether the response stays on the specified topics |
|
Step 2: Configure the OTLP endpoint
Create a ConfigMap that tells the pipeline where to send metrics:
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-config
namespace: testkube
data:
OTEL_EXPORTER_OTLP_ENDPOINT: "http://lgtm.monitoring.svc.cluster.local:4318"
Apply it:
kubectl apply -f otel-config.yaml
Step 3: Create the TestWorkflow
The TestWorkflow ties everything together. It mounts the experiment ConfigMap, injects the OTLP endpoint, and chains the pipeline templates:
apiVersion: testworkflows.testkube.io/v1
kind: TestWorkflow
metadata:
name: example-workflow (1)
namespace: testkube
labels:
testkube.io/test-category: ragas-evaluation
app: testworkflows
spec:
content:
files:
- path: /data/datasets/experiment.json (2)
contentFrom:
configMapKeyRef:
name: experiment
key: experiment.json
container:
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT (3)
valueFrom:
configMapKeyRef:
name: otel-config
key: OTEL_EXPORTER_OTLP_ENDPOINT
use:
- name: run-template (4)
config:
agentUrl: "http://weather-agent.sample-agents:8000" (5)
- name: evaluate-template
- name: publish-template
- name: visualize-template
| 1 | A unique name for your workflow |
| 2 | Mounts the experiment JSON from the ConfigMap into the shared data volume |
| 3 | Injects the OTLP endpoint as an environment variable for the publish phase |
| 4 | Templates are executed in order: run → evaluate → publish → visualize |
| 5 | The A2A endpoint of the agent you want to evaluate |
Apply it:
kubectl apply -f example-workflow.yaml
Step 4: Run and monitor the workflow
Start the workflow:
kubectl testkube run testworkflow example-workflow --watch
View logs after completion:
kubectl testkube get testworkflow example-workflow-1
Step 5: View results
Grafana dashboards
If you installed the Grafana dashboard ConfigMap (see Install the Testbench), open Grafana and look for the Testkube Evaluation dashboard. It displays per-metric scores filtered by workflow name.
HTML report artifact
The visualize phase produces a self-contained HTML report as a workflow artifact. Download it with:
kubectl testkube download artifacts example-workflow-1
The report includes:
-
Summary cards with total samples and metrics count
-
Horizontal bar charts showing mean score per metric
-
Metric distribution histograms with statistics
-
A searchable, sortable results table with all evaluations
Load a dataset from S3/MinIO (alternative to Step 1)
Instead of embedding the experiment in a ConfigMap, you can load a dataset from an S3-compatible store using the setup-template. Replace the content.files and prepend setup-template to the use list:
apiVersion: testworkflows.testkube.io/v1
kind: TestWorkflow
metadata:
name: s3-workflow
namespace: testkube
spec:
container:
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
valueFrom:
configMapKeyRef:
name: otel-config
key: OTEL_EXPORTER_OTLP_ENDPOINT
- name: MINIO_ENDPOINT
value: "http://minio.storage:9000"
- name: MINIO_ROOT_USER
value: "minioadmin"
- name: MINIO_ROOT_PASSWORD
value: "minioadmin"
use:
- name: setup-template
config:
datasetUrl: "http://data-server.data-server:8000/dataset.csv"
- name: run-template
config:
agentUrl: "http://weather-agent.sample-agents:8000"
- name: evaluate-template
- name: publish-template
- name: visualize-template
The setup-template downloads the dataset.
Auto-trigger on agent deployment (optional)
You can automatically run the evaluation workflow whenever the agent under test is redeployed. Create a TestTrigger:
apiVersion: tests.testkube.io/v1
kind: TestTrigger
metadata:
name: example-workflow-trigger
namespace: testkube
spec:
resource: deployment
resourceSelector:
name: weather-agent
namespace: sample-agents
event: modified
action: run
execution: testworkflow
concurrencyPolicy: allow
testSelector:
name: example-workflow
namespace: testkube
disabled: false
This trigger watches the weather-agent Deployment in the sample-agents namespace and runs the workflow on every modification.
Multi-turn conversation scenarios
To test multi-turn conversations, add multiple steps to a single scenario. The A2A protocol maintains conversation context across steps via a context_id:
{
"llm_as_a_judge_model": "gemini-2.5-flash-lite",
"default_threshold": 0.9,
"scenarios": [
{
"name": "Weather then time in New York",
"steps": [
{
"input": "What is the weather like in New York right now?",
"reference": {
"tool_calls": [
{ "name": "get_weather", "args": { "city": "New York" } }
]
},
"metrics": [
{ "metric_name": "AgentGoalAccuracyWithoutReference" },
{ "metric_name": "ToolCallAccuracy" }
]
},
{
"input": "What time is it in New York?",
"reference": {
"tool_calls": [
{ "name": "get_current_time", "args": { "city": "New York" } }
]
},
"metrics": [
{ "metric_name": "AgentGoalAccuracyWithoutReference" },
{ "metric_name": "ToolCallAccuracy" }
]
}
]
}
]
}
Each step within the scenario is sent sequentially to the agent. The agent receives the same context_id for all steps, allowing it to maintain conversation state.
Next steps
-
Learn about the internal data flow and evaluation architecture in Building Block View
-
Explore cross-cutting concerns like tracing and observability in Cross-cutting Concepts