Building Block View

Pipeline Overview

The Testbench operates as a sequential 4-phase pipeline. Each phase reads input from the previous phase’s output via a shared data volume.

Purpose: Downloads a test dataset from S3/MinIO.

Processing:

Purpose: Executes test queries against an agent via the A2A protocol and records responses.

Processing:

Initializes OpenTelemetry tracing
Loads the test dataset
For each dataset entry:
- Creates an OpenTelemetry span
- Sends the query to the agent via A2A
- Records the response and trace ID

Purpose: Calculates evaluation metrics using the LLM-as-a-judge approach.

Processing:

Purpose: Publishes per-sample evaluation metrics to an OTLP-compatible backend.

Processing:

Purpose: Generates a self-contained HTML dashboard from evaluation results.

Features:

Summary cards — total samples, metrics count, token usage, cost
Workflow metadata header — workflow name, execution ID, execution number
Overall scores bar chart — horizontal bars showing mean score per metric
Metric distribution histograms — per-metric score distributions with min/max/mean/median statistics
Detailed results table — all samples with per-metric scores, searchable and color-coded
Multi-turn conversation visualization — chat-bubble layout with color-coded message types
Self-contained HTML — works offline as a single file