Building Block View

Pipeline Overview

The Testbench operates as a sequential 4-phase pipeline. Each phase reads input from the previous phase’s output via a shared data volume.

pipeline-overview

Phase 1: Setup

Purpose: Downloads a test dataset from S3/MinIO.

Processing:

  1. Downloads the dataset file

  2. Writes the output to the shared data volume

Phase 2: Run

Purpose: Executes test queries against an agent via the A2A protocol and records responses.

Processing:

  1. Initializes OpenTelemetry tracing

  2. Loads the test dataset

  3. For each dataset entry:

    • Creates an OpenTelemetry span

    • Sends the query to the agent via A2A

    • Records the response and trace ID

Phase 3: Evaluate

Purpose: Calculates evaluation metrics using the LLM-as-a-judge approach.

Processing:

  1. Loads the experiment file

  2. Connects to the AI Gateway for LLM access

  3. Evaluates each row asynchronously

Phase 4: Publish

Purpose: Publishes per-sample evaluation metrics to an OTLP-compatible backend.

Processing:

  1. Loads the evaluation results

  2. Creates an OTLP metric exporter for HTTP transport

  3. For each sample and metric, creates a gauge observation with attributes:

    • name — metric type (e.g., "faithfulness")

    • workflow_name — test workflow identifier

    • execution_id — Testkube execution ID

    • execution_number — numeric execution counter

    • trace_id — links to the trace from the Run phase

    • sample_hash — unique sample identifier

    • user_input_truncated — first 50 characters of user input

  4. Flushes all metrics to ensure immediate export

Optional: Visualize

Purpose: Generates a self-contained HTML dashboard from evaluation results.

Features:

  • Summary cards — total samples, metrics count, token usage, cost

  • Workflow metadata header — workflow name, execution ID, execution number

  • Overall scores bar chart — horizontal bars showing mean score per metric

  • Metric distribution histograms — per-metric score distributions with min/max/mean/median statistics

  • Detailed results table — all samples with per-metric scores, searchable and color-coded

  • Multi-turn conversation visualization — chat-bubble layout with color-coded message types

  • Self-contained HTML — works offline as a single file