Cross-cutting Concepts
Observability
The Testbench integrates OpenTelemetry for both tracing and metrics.
Tracing (Run Phase)
The Run phase creates OpenTelemetry spans for each agent query:
-
Tracer provider with a service name resource and span processor
-
Span exporter using OTLP over HTTP/protobuf
-
HTTP auto-instrumentation for distributed trace context propagation (W3C Trace Context)
-
Per-query spans with attributes for user input, reference, agent URL, and workflow name
-
Multi-turn parent spans with child spans per conversation turn, tracking context ID and message count
The trace ID from each span is stored in the experiment output and propagated through evaluation to publishing, linking the full pipeline to the original agent interaction.
Metrics (Publish Phase)
The Publish phase exports per-sample evaluation gauges:
-
Gauge:
testbench_evaluation_metric— one observation per sample per metric -
Attributes: metric name, workflow name, execution ID, execution number, trace ID, sample hash, truncated user input
-
Resource: service name and workflow name
-
Exporter: OTLP over HTTP/protobuf
A2A Protocol Integration
The Testbench communicates with agents under test using the A2A (Agent-to-Agent) JSON-RPC protocol.
-
Client initialization — creates a minimal agent card from the agent URL
-
Message sending — sends messages asynchronously and collects responses
-
Context management — A2A context ID maintains conversation state across multiple turns in multi-turn evaluations
-
Response extraction — agent responses are extracted from A2A task artifacts and stored in the experiment output
Evaluation Framework
The Testbench uses a framework-agnostic adapter pattern for LLM-as-a-judge evaluation.
Adapter Pattern
The evaluation system is built around three core abstractions:
-
Metrics Registry — central registry that manages framework adapters and provides a unified interface for metric creation
-
Framework Adapter — defines the contract for plugging in evaluation frameworks (discovery, instantiation, wrapping)
-
Metric Callable — unified interface for metric execution: accepts a sample and returns a score with an optional reason
Each framework adapter wraps its native metric instances in a callable that handles parameter filtering, data format translation, and result extraction. This decouples the evaluation pipeline from any specific framework.