Testbench Reference

This reference covers the resources installed by the Testbench install.yaml manifest: the Experiment Custom Resource Definition, the five TestWorkflowTemplate CRDs in the testkube namespace, and the OTLP metrics contract used by the publish phase.

Installation artifact

Testbench is installed as a single bundled manifest:

kubectl apply -f https://github.com/agentic-layer/testbench/releases/latest/download/install.yaml

The manifest creates:

Resource Location

Operator controller deployment

Namespace testbench-operator-system

Experiment CRD (testbench.agentic-layer.ai/v1alpha1)

Cluster-scoped CRD; instances are namespaced

Five TestWorkflowTemplate CRDs

Namespace testkube

grafana-testkube-dashboard ConfigMap

Namespace monitoring

Customisation is performed via Kustomize overlays against the released install.yaml. The container image and dashboard namespace are the two values commonly overridden — see the overlays in operator/config/samples/overlays/ for templates.

Experiment CRD

apiVersion: testbench.agentic-layer.ai/v1alpha1, kind: Experiment. Cluster-scoped CRD with namespaced instances. Short name: exp.

Each Experiment instance reconciles into one Testkube TestWorkflow, one dataset ConfigMap, and (optionally) one TestTrigger, all in the testkube namespace.

spec

Field Type Required Description

agentRef

object

yes

Reference to the Agent resource to evaluate. See AgentRef.

aiGatewayRef

object

no

Reference to the AiGateway resource providing the LLM judge. Uses the standard corev1.ObjectReference schema; only name and namespace are honoured.

dataset

object

yes

Source of the test dataset. Exactly one of dataset.inline, dataset.url, or dataset.s3 must be set. See DatasetSource.

env

[]corev1.EnvVar

no

Environment variables injected into the generated TestWorkflow pods. Supports value and valueFrom (secretKeyRef, configMapKeyRef, etc.). User-defined entries override operator-set defaults with the same name.

schedule

object

no

Cron-based execution. See ScheduleSpec.

trigger

object

no

Event-based re-execution when the referenced agent is redeployed. See TriggerSpec.

AgentRef

Field Type Required Description

name

string

yes

Name of the Agent resource.

namespace

string

no

Namespace of the Agent resource. Defaults to the `Experiment’s namespace.

DatasetSource

Exactly one of the three sub-fields must be set. The CRD enforces this via a CEL validation rule.

Field Type Description

inline

object

Inline dataset specified directly in the Experiment. See InlineDataset.

url

string

HTTP/HTTPS URL to a .csv, .json, or .parquet file.

s3

object

S3/MinIO source. See S3Source.

InlineDataset

Field Type Required Description

llmAsAJudgeModel

string

no

LLM model identifier used by the evaluate phase (e.g. gemini-2.5-flash-lite).

defaultThreshold

float (0.0–1.0)

no

Default pass/fail threshold applied to metrics that do not specify their own.

scenarios

[]Scenario

yes

One or more scenarios. See Scenario.

S3Source

Field Type Description

bucket

string

S3/MinIO bucket name.

key

string

Object key (path within the bucket). Supported formats: .csv, .json, .parquet.

Credentials are supplied through spec.env (typically MINIO_ENDPOINT, MINIO_ROOT_USER, MINIO_ROOT_PASSWORD).

Scenario

Field Type Required Description

name

string

yes

Scenario name. Surfaces as the scenario attribute on published metrics.

steps

[]Step

yes

One or more sequential steps. All steps in a scenario share the same A2A contextId, enabling multi-turn conversations.

Step

Field Type Required Description

input

string

yes

User input sent to the agent.

reference

object

no

Expected outputs used by reference-based metrics. See Reference.

customValues

object (free-form)

no

Arbitrary key/value pairs forwarded to metric callables for adapter-specific inputs. Schemaless; preserved unchanged. (For RAGAS context metrics, supply retrieved passages via reference.retrievedContexts instead.)

metrics

[]Metric

no

Metrics to evaluate for this step. See Metric.

Reference

Field Type Description

response

string

Expected response text.

toolCalls

[]ToolCall

Expected tool invocations. See ToolCall.

topics

[]string

Expected topics the response should cover. Used by TopicAdherence.

retrievedContexts

[]string

Context passages retrieved by the agent’s RAG pipeline. Used by RAGAS context metrics such as Faithfulness, ContextPrecision, and ContextRecall.

ToolCall

Field Type Description

name

string

Tool name.

args

object (free-form)

Tool arguments as a JSON object. Schemaless; preserved unchanged.

Metric

Field Type Description

metricName

string

Metric identifier resolved by the GenericMetricsRegistry (e.g. AgentGoalAccuracyWithoutReference, ToolCallAccuracy, TopicAdherence).

threshold

float (0.0–1.0)

Per-metric pass/fail threshold. Overrides dataset.inline.defaultThreshold.

parameters

object (free-form)

Metric-specific parameters (e.g. mode: precision for TopicAdherence). Schemaless; preserved unchanged.

TriggerSpec

Field Type Description

enabled

boolean

When true, the operator creates a Testkube TestTrigger that re-runs the workflow whenever the referenced agent is redeployed.

concurrencyPolicy

enum: Allow | Forbid | Replace

How concurrent executions are handled. Mirrors the Testkube TestTrigger semantics.

ScheduleSpec

Field Type Description

cron

string

Standard Kubernetes 5-field cron expression (e.g. 0 3 * * *). Required.

timezone

string

IANA timezone name (e.g. Europe/Berlin). Defaults to cluster local time.

schedule and trigger are independent and may be combined.

status

Field Type Description

conditions

[]metav1.Condition

Standard Kubernetes status conditions. The Ready condition reflects overall reconcile health.

generatedResources

[]GeneratedResource

Resources created by the operator for this Experiment (kind, name, namespace).

lastExecution

object

Metadata from the most recent execution: executionId, executionNumber, startTime, endTime, status.

Print columns

kubectl get experiments shows: Agent (from spec.agentRef.name), Status (from the Ready condition), and Age.

TestWorkflowTemplates

The installer creates five TestWorkflowTemplate resources in the testkube namespace. The Experiment-generated TestWorkflow references them via its use list; all templates share the calling workflow’s /app/data volume.

You normally do not invoke these templates directly — the operator wires them into the generated workflow. They are documented here for diagnostics, custom workflows, and overlay authoring.

setup-template

Downloads a dataset from an S3-compatible store and writes it to the shared volume as an Experiment JSON file.

Config parameter Type Description

bucket

string

S3/MinIO bucket name containing the dataset.

key

string

S3/MinIO object key (path to dataset file). Supported formats: .csv, .json, .parquet.

Input: S3/MinIO object referenced by bucket / key.

Output: data/datasets/experiment.json — an Experiment model serialized as JSON.

Use this template when your dataset lives in object storage. For dataset.inline or dataset.url Experiments, the operator skips this template and provides the dataset by other means (ConfigMap mount or HTTP fetch).

run-template

Sends every step in the experiment to the target agent via the A2A protocol and records responses.

Config parameter Type Description

agentUrl

string

HTTP URL of the agent’s A2A endpoint (e.g. http://weather-agent.sample-agents:8000).

experimentName

string

Experiment name used for OpenTelemetry labelling on emitted spans and metrics.

Input: data/datasets/experiment.json.

Output: data/experiments/executed_experiment.json — an ExecutedExperiment model.

Implicit inputs: workflow.name (injected automatically by Testkube).

evaluate-template

Scores agent responses using LLM-as-a-judge metrics via the configured metrics framework (RAGAS by default).

Config parameter Type Description

openAiBasePath

string

Base URL for the OpenAI-compatible API used by the LLM judge (e.g. the in-cluster AI Gateway). Maps to the OPENAI_BASE_URL environment variable in the phase container. Defaults to empty string.

openAiApiKey

string

API key for the OpenAI-compatible API. Maps to the OPENAI_API_KEY environment variable. Defaults to empty string.

Input: data/experiments/executed_experiment.json.

Output: data/experiments/evaluated_experiment.json — an EvaluatedExperiment model.

publish-template

Publishes per-step evaluation scores to the OTLP endpoint configured via the OTEL_EXPORTER_OTLP_ENDPOINT environment variable.

Config parameters: none.

Input: data/experiments/evaluated_experiment.json plus OTEL_EXPORTER_OTLP_ENDPOINT environment variable.

Output: Gauge metrics emitted to the OTLP collector. Each metric is labeled with workflow_name, scenario, and step attributes.

Implicit inputs: workflow.name, execution.id, execution.number (injected automatically by Testkube).

visualize-template

Generates a self-contained HTML evaluation report and saves it as a workflow artifact.

Config parameters: none.

Input: data/experiments/evaluated_experiment.json.

Output: data/results/evaluation_report.html — a single-file HTML dashboard with Chart.js visualizations (summary cards, score bar charts, metric distribution histograms, sortable results table).

Implicit inputs: workflow.name, execution.id, execution.number (injected automatically by Testkube).

OTLP metrics contract

The publish-template exports OpenTelemetry gauge metrics over HTTP/protobuf to port 4318. Each evaluation step produces one gauge observation per metric:

Attribute Value

Metric name

The metricName string from the Experiment (e.g. AgentGoalAccuracyWithoutReference).

Gauge value

Float in [0.0, 1.0].

workflow_name

The Testkube workflow name.

scenario

The scenario name from the experiment.

step

Zero-based step index within the scenario.

The OTLP endpoint is read from the OTEL_EXPORTER_OTLP_ENDPOINT environment variable at runtime. Inject it via spec.env on the Experiment (typically pointing at the in-cluster collector, e.g. http://lgtm.monitoring.svc.cluster.local:4318).