Introduction and Goals

Purpose and Scope

The Testbench is a Kubernetes-native evaluation system that measures the quality of AI agents using pluggable evaluation framework adapters. It executes test datasets against agents via the A2A (Agent-to-Agent) protocol and publishes evaluation metrics through OTLP to an observability backend.

The system operates as a sequential 4-phase pipeline:

  1. Setup — Download a test dataset

  2. Run — Execute queries against an agent under test via A2A protocol

  3. Evaluate — Calculate metrics using an LLM-as-a-judge approach

  4. Publish — Send evaluation results to an OTLP-compatible observability backend

An optional Visualize phase generates a self-contained HTML dashboard for local viewing and sharing.

Key Features

  • 4-phase sequential pipeline with well-defined intermediate data formats between phases

  • A2A protocol integration — evaluates any agent that implements the A2A JSON-RPC specification

  • Single-turn and multi-turn conversation support — automatic detection based on dataset structure

  • Configurable evaluation metrics — JSON/YAML configuration files with pluggable framework adapters

  • OTLP observability — per-sample metric gauges published to any OpenTelemetry-compatible backend

  • OpenTelemetry tracing — spans for each agent query with trace context propagation

  • HTML visualization — self-contained dashboard with charts, statistics, and searchable results

  • Testkube orchestration — packaged as reusable TestWorkflowTemplate CRDs composed into workflows

Role in the Agentic Layer

The Testbench is a component of the Agentic Layer platform, providing automated testing and quality assurance for deployed agents. It integrates with:

  • Agent Runtime Operator — manages the lifecycle of agents under test

  • AI Gateway — routes LLM requests for evaluation to configured model providers

Quality Goals

Goal Description

Reproducibility

Deterministic pipeline execution with versioned datasets and metric configurations

Extensibility

New metrics from any registered framework adapter are automatically discovered without code changes

Observability

Full traceability from test execution to evaluation results via OpenTelemetry

Portability

Single Docker image for all phases, deployable via Helm chart to any Kubernetes cluster