Testbench Reference
This reference covers the resources installed by the Testbench install.yaml manifest: the Experiment Custom Resource Definition, the five TestWorkflowTemplate CRDs in the testkube namespace, and the OTLP metrics contract used by the publish phase.
Installation artifact
Testbench is installed as a single bundled manifest:
kubectl apply -f https://github.com/agentic-layer/testbench/releases/latest/download/install.yaml
The manifest creates:
| Resource | Location |
|---|---|
Operator controller deployment |
Namespace |
|
Cluster-scoped CRD; instances are namespaced |
Five |
Namespace |
|
Namespace |
Customisation is performed via Kustomize overlays against the released install.yaml. The container image and dashboard namespace are the two values commonly overridden — see the overlays in operator/config/samples/overlays/ for templates.
Experiment CRD
apiVersion: testbench.agentic-layer.ai/v1alpha1, kind: Experiment. Cluster-scoped CRD with namespaced instances. Short name: exp.
Each Experiment instance reconciles into one Testkube TestWorkflow, one dataset ConfigMap, and (optionally) one TestTrigger, all in the testkube namespace.
spec
| Field | Type | Required | Description |
|---|---|---|---|
|
object |
yes |
Reference to the |
|
object |
no |
Reference to the |
|
object |
yes |
Source of the test dataset. Exactly one of |
|
|
no |
Environment variables injected into the generated |
|
object |
no |
Cron-based execution. See ScheduleSpec. |
|
object |
no |
Event-based re-execution when the referenced agent is redeployed. See TriggerSpec. |
AgentRef
| Field | Type | Required | Description |
|---|---|---|---|
|
string |
yes |
Name of the |
|
string |
no |
Namespace of the |
DatasetSource
Exactly one of the three sub-fields must be set. The CRD enforces this via a CEL validation rule.
| Field | Type | Description |
|---|---|---|
|
object |
Inline dataset specified directly in the |
|
string |
HTTP/HTTPS URL to a |
|
object |
S3/MinIO source. See S3Source. |
InlineDataset
| Field | Type | Required | Description |
|---|---|---|---|
|
string |
no |
LLM model identifier used by the evaluate phase (e.g. |
|
float (0.0–1.0) |
no |
Default pass/fail threshold applied to metrics that do not specify their own. |
|
|
yes |
One or more scenarios. See Scenario. |
Scenario
| Field | Type | Required | Description |
|---|---|---|---|
|
string |
yes |
Scenario name. Surfaces as the |
|
|
yes |
One or more sequential steps. All steps in a scenario share the same A2A |
Step
| Field | Type | Required | Description |
|---|---|---|---|
|
string |
yes |
User input sent to the agent. |
|
object |
no |
Expected outputs used by reference-based metrics. See Reference. |
|
object (free-form) |
no |
Arbitrary key/value pairs forwarded to metric callables for adapter-specific inputs. Schemaless; preserved unchanged. (For RAGAS context metrics, supply retrieved passages via |
|
|
no |
Metrics to evaluate for this step. See Metric. |
Reference
| Field | Type | Description |
|---|---|---|
|
string |
Expected response text. |
|
|
Expected tool invocations. See ToolCall. |
|
|
Expected topics the response should cover. Used by |
|
|
Context passages retrieved by the agent’s RAG pipeline. Used by RAGAS context metrics such as |
ToolCall
| Field | Type | Description |
|---|---|---|
|
string |
Tool name. |
|
object (free-form) |
Tool arguments as a JSON object. Schemaless; preserved unchanged. |
Metric
| Field | Type | Description |
|---|---|---|
|
string |
Metric identifier resolved by the |
|
float (0.0–1.0) |
Per-metric pass/fail threshold. Overrides |
|
object (free-form) |
Metric-specific parameters (e.g. |
TriggerSpec
| Field | Type | Description |
|---|---|---|
|
boolean |
When |
|
enum: |
How concurrent executions are handled. Mirrors the Testkube |
ScheduleSpec
| Field | Type | Description |
|---|---|---|
|
string |
Standard Kubernetes 5-field cron expression (e.g. |
|
string |
IANA timezone name (e.g. |
schedule and trigger are independent and may be combined.
status
| Field | Type | Description |
|---|---|---|
|
|
Standard Kubernetes status conditions. The |
|
|
Resources created by the operator for this |
|
object |
Metadata from the most recent execution: |
TestWorkflowTemplates
The installer creates five TestWorkflowTemplate resources in the testkube namespace. The Experiment-generated TestWorkflow references them via its use list; all templates share the calling workflow’s /app/data volume.
You normally do not invoke these templates directly — the operator wires them into the generated workflow. They are documented here for diagnostics, custom workflows, and overlay authoring.
setup-template
Downloads a dataset from an S3-compatible store and writes it to the shared volume as an Experiment JSON file.
| Config parameter | Type | Description |
|---|---|---|
|
string |
S3/MinIO bucket name containing the dataset. |
|
string |
S3/MinIO object key (path to dataset file). Supported formats: |
Input: S3/MinIO object referenced by bucket / key.
Output: data/datasets/experiment.json — an Experiment model serialized as JSON.
Use this template when your dataset lives in object storage. For dataset.inline or dataset.url Experiments, the operator skips this template and provides the dataset by other means (ConfigMap mount or HTTP fetch).
run-template
Sends every step in the experiment to the target agent via the A2A protocol and records responses.
| Config parameter | Type | Description |
|---|---|---|
|
string |
HTTP URL of the agent’s A2A endpoint (e.g. |
|
string |
Experiment name used for OpenTelemetry labelling on emitted spans and metrics. |
Input: data/datasets/experiment.json.
Output: data/experiments/executed_experiment.json — an ExecutedExperiment model.
Implicit inputs: workflow.name (injected automatically by Testkube).
evaluate-template
Scores agent responses using LLM-as-a-judge metrics via the configured metrics framework (RAGAS by default).
| Config parameter | Type | Description |
|---|---|---|
|
string |
Base URL for the OpenAI-compatible API used by the LLM judge (e.g. the in-cluster AI Gateway). Maps to the |
|
string |
API key for the OpenAI-compatible API. Maps to the |
Input: data/experiments/executed_experiment.json.
Output: data/experiments/evaluated_experiment.json — an EvaluatedExperiment model.
publish-template
Publishes per-step evaluation scores to the OTLP endpoint configured via the OTEL_EXPORTER_OTLP_ENDPOINT environment variable.
Config parameters: none.
Input: data/experiments/evaluated_experiment.json plus OTEL_EXPORTER_OTLP_ENDPOINT environment variable.
Output: Gauge metrics emitted to the OTLP collector. Each metric is labeled with workflow_name, scenario, and step attributes.
Implicit inputs: workflow.name, execution.id, execution.number (injected automatically by Testkube).
visualize-template
Generates a self-contained HTML evaluation report and saves it as a workflow artifact.
Config parameters: none.
Input: data/experiments/evaluated_experiment.json.
Output: data/results/evaluation_report.html — a single-file HTML dashboard with Chart.js visualizations (summary cards, score bar charts, metric distribution histograms, sortable results table).
Implicit inputs: workflow.name, execution.id, execution.number (injected automatically by Testkube).
OTLP metrics contract
The publish-template exports OpenTelemetry gauge metrics over HTTP/protobuf to port 4318. Each evaluation step produces one gauge observation per metric:
| Attribute | Value |
|---|---|
Metric name |
The |
Gauge value |
Float in |
|
The Testkube workflow name. |
|
The scenario name from the experiment. |
|
Zero-based step index within the scenario. |
The OTLP endpoint is read from the OTEL_EXPORTER_OTLP_ENDPOINT environment variable at runtime. Inject it via spec.env on the Experiment (typically pointing at the in-cluster collector, e.g. http://lgtm.monitoring.svc.cluster.local:4318).