Run Testbench Standalone

This guide walks you through running the full evaluation pipeline as a single CLI invocation against an agent that is already reachable on your network, without deploying anything into Kubernetes.

Use this when you want to evaluate an agent from a developer machine, a CI job, or any environment where Testkube is not available.

Prerequisites

  • uv installed

  • An agent exposing an A2A protocol endpoint reachable from where you run the CLI

  • An OpenAI-compatible LLM endpoint (e.g. AI Gateway) and the corresponding API key for the judge model

  • (Optional) An OTLP collector if you want to publish evaluation scores

Step 1: Install the CLI

Install the testworkflow command into an isolated, automatically-managed environment:

uv tool install agentic-layer-testbench

This puts testworkflow on your PATH without polluting the system Python or any project virtualenv. Upgrade later with uv tool upgrade agentic-layer-testbench.

Alternatively, run it ad-hoc without installing — uv resolves and caches the package on first use:

uvx --from agentic-layer-testbench testworkflow config.yaml

The CLI runs the full Setup → Run → Evaluate → Publish → Visualize pipeline in a single process.

Step 2: Write a config.yaml

Create a config.yaml describing the agent to evaluate, the dataset to use, and the experiment metadata. The smallest useful configuration embeds the Experiment directly:

agent:
  url: "http://localhost:11010" (1)

experiment:
  name: "my-evaluation"

dataset:
  source: inline (2)
  inline:
    llm_as_a_judge_model: gemini-2.5-flash-lite
    default_threshold: 0.9
    scenarios:
      - name: "Weather in New York"
        steps:
          - input: "What is the weather like in New York right now?"
            reference:
              tool_calls:
                - name: get_weather
                  args:
                    city: "New York"
              topics:
                - weather
            metrics:
              - metric_name: AgentGoalAccuracyWithoutReference
              - metric_name: ToolCallAccuracy
              - metric_name: TopicAdherence
                parameters:
                  mode: precision
1 A2A endpoint of the agent you want to evaluate
2 inline embeds the Experiment in the same file — no separate JSON needed
The inline payload mirrors the dataset.inline field on the Experiment CRD, so the same Experiment can be moved between standalone runs and in-cluster workflows without rewriting it.

For other dataset sources (external file, HTTP URL, S3/MinIO), see config.example.yaml in the repository.

Step 3: Run the pipeline

Point the CLI at the config file:

testworkflow config.yaml

The CLI runs all phases in order and writes intermediate JSON to data/ (overridable per phase). The HTML report is written to data/results/evaluation_report.html.

By default the run exits non-zero if any metric evaluation fails. To keep the run green regardless of metric pass/fail, set evaluate.fail_on_metric_failure: false in config.yaml.

Step 4: Publish metrics (optional)

To publish evaluation scores to an OTLP-compatible backend, point the standard OpenTelemetry environment variable at the collector before invoking the CLI:

OTEL_EXPORTER_OTLP_ENDPOINT="http://localhost:4318" testworkflow config.yaml

Each evaluation produces one gauge observation per metric labeled with experiment_name, scenario, and step — same contract as the in-cluster publish-template.

Next steps