Inside the LLM Call: GenAI Observability with OpenTelemetry

Your AI agent just took 45 seconds to answer a simple question. Was it the model? A slow tool call? A retry loop? Every time an application calls an LLM, a chain of model calls, tool invocations, and token exchanges happens behind the scenes — and without observability, you are guessing.

The OpenTelemetry Semantic Conventions for Generative AI give you that visibility. They standardize how GenAI operations are recorded — the model being called, input and output token counts, and when opted in, the full content of prompts, completions, tool calls, and tool results.

This post walks through:

  • Exporting GenAI telemetry from an LLM-powered app.
  • Configuring an observability tool to receive and display that telemetry.
  • Exploring GenAI traces, metrics, and events with a GenAI visualizer.

Exporting GenAI telemetry

For this walkthrough, we use VS Code Copilot to generate telemetry, since most developers already have it installed. However, many coding assistants support monitoring with OpenTelemetry:

  • VS Code Copilot emits traces, metrics, and events for every agent interaction.
  • OpenAI Codex exports structured log events and OTel metrics for API requests, tool calls, and sessions.
  • Claude Code exports metrics and log events via OTel, with trace support in beta.

Beyond monitoring the tools you already use, you can add OpenTelemetry to your own GenAI-powered app to get insight into how it interacts with LLMs.

Configure telemetry export

Telemetry export requires a few settings. For VS Code Copilot, open Settings and search for copilot otel:

SettingDescriptionValue
github.copilot.chat.otel.enabledEnable OTel emissiontrue
github.copilot.chat.otel.captureContentCapture full prompt/response contenttrue
github.copilot.chat.otel.otlpEndpointOTLP collector endpoint"http://localhost:4318" (default, no change needed)

By default, no prompt content or tool arguments are captured with GenAI telemetry, as these can contain sensitive data. Only metadata like model names, token counts, and durations are included. Enabling content capture populates span attributes with full prompt messages, system prompts, tool schemas, tool arguments, and tool results.

Exploring GenAI telemetry

Any OTLP-compatible backend can receive GenAI telemetry. For this walkthrough, we use the Aspire Dashboard — a free, open source telemetry viewer that ships as a Docker container. It accepts OTLP data directly and provides a built-in trace viewer, metrics explorer, and structured logs page — no cloud account required. It is well suited for local development and debugging of GenAI workloads.

Run the following Docker command to get started:

docker run --rm -p 18888:18888 -p 4317:18889 -p 4318:18890 -d --name aspire-dashboard \
    -e ASPIRE_DASHBOARD_UNSECURED_ALLOW_ANONYMOUS=true \
    mcr.microsoft.com/dotnet/aspire-dashboard:latest

The dashboard collects telemetry sent to http://localhost:4318, and you can view telemetry by visiting http://localhost:18888. The dashboard also requires authentication by default. Use -e ASPIRE_DASHBOARD_UNSECURED_ALLOW_ANONYMOUS=true to allow anonymous access during local development.

Explore traces

GenAI operations from VS Code Copilot are now recorded and observable. Ask Copilot a question in VS Code, then open the Traces page in the dashboard. You will see entries for each LLM interaction.

Selecting a trace reveals the span tree: the top-level invoke_agent span with child chat spans for each LLM call and execute_tool spans for each tool invocation.

Aspire Dashboard showing the span tree for a GenAI trace

The span details show GenAI semantic convention attributes:

  • gen_ai.request.model — the model used (for example, gpt-4o).
  • gen_ai.usage.input_tokens and gen_ai.usage.output_tokens — token counts for each LLM call.
  • gen_ai.response.finish_reasons — why the model stopped generating (for example, stop or tool_calls).

When an app is configured to record content, messages and tool calls are captured as structured span attributes such as gen_ai.system_instructions, gen_ai.input.messages, and gen_ai.output.messages. This content is valuable for debugging, but these attributes can be large, and many observability platforms render them as raw JSON, making them difficult to read.

Observability tools can include specialized UI for viewing GenAI telemetry. We’ll use a GenAI telemetry visualizer that parses these attributes and renders a chat-style view of the conversation, showing system prompts, user messages, assistant responses, and tool call arguments and results.

Aspire Dashboard GenAI telemetry visualizer showing a chat-style view of prompts and responses

No more guessing about LLM usage or digging through raw JSON. With GenAI telemetry, every prompt, response, and tool call is visible at a glance.

Explore metrics

Navigate to the Metrics page and select the copilot-chat service. The GenAI metrics are prefixed with gen_ai:

  • gen_ai.client.operation.duration — histogram of LLM call latencies. Filter by gen_ai.request.model to compare models.
  • gen_ai.client.token.usage — histogram of token consumption. Filter by gen_ai.token.type to separate input from output tokens.

These metrics let you estimate per-request cost, catch token-hungry prompts before they hit production, detect latency regressions, and monitor usage patterns across models and agents.

Dashboard metrics page showing GenAI metrics

Beyond this demo

The GenAI semantic conventions are already in use today and under active development — your feedback on real-world usage directly shapes what gets standardized next.