The Observability Dashboard
What it is
The Observability Dashboard is a lightweight, in-cluster service that ingests
OpenTelemetry traces from Agentic Layer workloads and broadcasts structured agent
events to WebSocket clients in real time. It exposes an OTLP/HTTP endpoint at
/v1/traces, processes each incoming span through a preprocessing pipeline, and
pushes the resulting events to all connected dashboard clients. A bundled
single-page application renders those events as a live activity feed, allowing
operators to see agent lifecycle transitions, LLM calls, tool invocations, and
agent-to-agent interactions as they occur.
Why it exists
Classical APM tools — distributed tracing platforms and log aggregators — are built
around request/response spans and service-level metrics. They work well for
understanding which HTTP call was slow or which database query took too long. Agentic
systems have a different shape: a single user message may trigger a conversation that
spans multiple agents, dozens of LLM calls, and many tool invocations, all linked by
a shared conversation_id. The raw OTel spans that capture this activity are correct
and importable into any standards-compliant backend, but navigating a flamegraph to
understand whether a summarizer agent received the right context from a news-fetcher
tool is cumbersome.
The Observability Dashboard solves this by translating raw spans into a vocabulary
that matches the mental model of agentic systems: agent_start, llm_call_end,
invoke_agent_start, and so on. It does not replace a general-purpose observability
stack — the Agentic Layer showcase also deploys Grafana with LGTM for metrics and
logs — but it provides a focused, low-friction view that surfaces agent-level
communication patterns without requiring knowledge of OTel span attributes or
TraceQL.
How it fits
The dashboard sits downstream of both the AI Gateway and the Agent Runtime in the
Agentic Layer. The Agent Runtime Operator manages the agent workloads; each agent
framework emits OTel spans bearing conversation_id and agent_name attributes,
which the agents export to the dashboard’s /v1/traces endpoint over the
cluster-internal network.
When a trace batch arrives, the span preprocessor (span_preprocessor.py) examines
each span’s name against a set of known prefixes (before_agent, before_model,
after_tool, etc.) to determine the event type, extracts required span attributes,
and constructs a typed event object. The connection manager then delivers the event
as a JSON-serialized WebSocket message to every connected client whose optional
filter criteria (conversation ID and/or workforce name) match the event.
The result is a push-based pipeline: OTel spans in → structured events out → WebSocket push to dashboard. No polling, no persistent storage, no secondary database.
Trade-offs and alternatives
WebSocket push vs. polling
The dashboard delivers events to browsers over a persistent WebSocket connection rather than having clients poll an HTTP endpoint. This reduces latency from multi-second poll intervals to near-zero and eliminates redundant requests when no events are occurring. The trade-off is that each connected client holds an open TCP connection for the lifetime of the session, which can become a resource concern at scale. For the observability use case — a small number of human operators watching a dashboard — this is the right choice.
Structured events vs. raw spans
The preprocessing step discards spans that do not match the known agent-event patterns and remaps the ones that do match into a small, stable set of event types. This makes the WebSocket protocol easy to consume in a browser without requiring clients to understand OTel semantics. The cost is that the dashboard silently drops spans it does not recognise; consumers that need the full OTel record should send traces to a general-purpose backend (such as the LGTM stack in the showcase) in parallel.
In-cluster deployment vs. external SaaS APM
Deploying the dashboard in-cluster keeps agent trace data within the Kubernetes network boundary and avoids egress costs or data-residency concerns. It also means that the dashboard’s state is ephemeral: events are not persisted, and filter registry entries expire after 24 hours of inactivity. Teams that need historical trace search, alerting, or long-term retention should export spans to a persistent backend alongside the dashboard.