Run Testbench Standalone
This guide walks you through running the full evaluation pipeline as a single CLI invocation against an agent that is already reachable on your network, without deploying anything into Kubernetes.
Use this when you want to evaluate an agent from a developer machine, a CI job, or any environment where Testkube is not available.
Prerequisites
-
uv installed
-
An agent exposing an A2A protocol endpoint reachable from where you run the CLI
-
An OpenAI-compatible LLM endpoint (e.g. AI Gateway) and the corresponding API key for the judge model
-
(Optional) An OTLP collector if you want to publish evaluation scores
Step 1: Install the CLI
Install the testworkflow command into an isolated, automatically-managed environment:
uv tool install agentic-layer-testbench
This puts testworkflow on your PATH without polluting the system Python or any project virtualenv. Upgrade later with uv tool upgrade agentic-layer-testbench.
Alternatively, run it ad-hoc without installing — uv resolves and caches the package on first use:
uvx --from agentic-layer-testbench testworkflow config.yaml
The CLI runs the full Setup → Run → Evaluate → Publish → Visualize pipeline in a single process.
Step 2: Write a config.yaml
Create a config.yaml describing the agent to evaluate, the dataset to use, and the experiment metadata. The smallest useful configuration embeds the Experiment directly:
agent:
url: "http://localhost:11010" (1)
experiment:
name: "my-evaluation"
dataset:
source: inline (2)
inline:
llm_as_a_judge_model: gemini-2.5-flash-lite
default_threshold: 0.9
scenarios:
- name: "Weather in New York"
steps:
- input: "What is the weather like in New York right now?"
reference:
tool_calls:
- name: get_weather
args:
city: "New York"
topics:
- weather
metrics:
- metric_name: AgentGoalAccuracyWithoutReference
- metric_name: ToolCallAccuracy
- metric_name: TopicAdherence
parameters:
mode: precision
| 1 | A2A endpoint of the agent you want to evaluate |
| 2 | inline embeds the Experiment in the same file — no separate JSON needed |
The inline payload mirrors the dataset.inline field on the Experiment CRD, so the same Experiment can be moved between standalone runs and in-cluster workflows without rewriting it.
|
For other dataset sources (external file, HTTP URL, S3/MinIO), see config.example.yaml in the repository.
Step 3: Run the pipeline
Point the CLI at the config file:
testworkflow config.yaml
The CLI runs all phases in order and writes intermediate JSON to data/ (overridable per phase). The HTML report is written to data/results/evaluation_report.html.
By default the run exits non-zero if any metric evaluation fails. To keep the run green regardless of metric pass/fail, set evaluate.fail_on_metric_failure: false in config.yaml.
Step 4: Publish metrics (optional)
To publish evaluation scores to an OTLP-compatible backend, point the standard OpenTelemetry environment variable at the collector before invoking the CLI:
OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318" testworkflow config.yaml
Each evaluation produces one gauge observation per metric labeled with experiment_name, scenario, and step — same contract as the in-cluster publish-template.
Next steps
-
To move the same evaluation into Kubernetes, see Create Your First TestWorkflow — the inline Experiment shape carries over to the
ExperimentCRD.