An Introduction to Observability for LLM-based applications using OpenTelemetry

By Ishan Jain (Grafana) | Tuesday, June 04, 2024

Blog posts are not updated after publication. This post is more than a year old, so its content may be outdated, and some links may be invalid. Cross-verify any information before relying on it.

Large Language Models (LLMs) are really popular right now, especially considering the wide range of applications that they have from simple chatbots to Copilot bots that are helping software engineers write code. Seeing the growing use of LLMs in production, it’s important for users to learn how to understand and monitor how these models behave.

In the following example, we’ll use Prometheus and Jaeger as the target backend for metrics and traces generated by an auto-instrumentation LLM monitoring library OpenLIT. We will use Grafana as the tool to visualize the LLM monitoring data. You can choose any backend of your choice to store OTel metrics and traces.

Why Observability Matters for LLM Applications

Monitoring LLM applications is crucial for several reasons.

It’s vital to keep track of how often LLMs are being used for usage and cost tracking.
Latency is important to track since the response time from the model can vary based on the inputs passed to the LLM.
Rate limiting is a common challenge, particularly for external LLMs, as applications depend more on these external API calls. When rate limits are hit, it can hinder these applications from performing their essential functions using these LLMs.

By keeping a close eye on these aspects, you can not only save costs but also avoid hitting request limits, ensuring your LLM applications perform optimally.

What are the signals that you should be looking at?

Using Large Language Models (LLMs) in applications differs from traditional machine learning (ML) models. Primarily, LLMs are often accessed through external API calls instead of being run locally or in-house. It is crucial to capture the sequence of events (using traces), especially in a RAG-based application where there can be events before and after LLM usage. Also, analyzing the aggregated data (through metrics) provides a quick overview like request, tokens and cost is important for optimizing performance and managing costs. Here are the key signals to monitor:

Traces

Request Metadata: This is important in the context of LLMs, given the variety of parameters (like temperature and top_p) that can drastically affect both the response quality and the cost. Specific aspects to monitor are:
- Temperature: Indicates the level of creativity or randomness desired from the model’s outputs. Varying this parameter can significantly impact the nature of the generated content.
- top_p: Decides how selective the model is by choosing from a certain percentage of most likely words. A high “top_p” value means the model considers a wider range of words, making the text more varied.
- Model Name or Version: Essential for tracking over time, as updates to the LLM might affect performance or response characteristics.
- Prompt Details: The exact inputs sent to the LLM, which, unlike in-house ML models where inputs might be more controlled and homogeneous, can vary wildly and affect output complexity and cost implications.
Response Metadata: Given the API-based interaction with LLMs, tracking the specifics of the response is key for cost management and quality assessment:
- Tokens: Directly impacts cost and is a measure of response length and complexity.
- Cost: Critical for budgeting, as API-based costs can scale with the number of requests and the complexity of each request.
- Response Details: Similar to the prompt details but from the response perspective, providing insights into the model’s output characteristics and potential areas of inefficiency or unexpected cost.

Note

The LLM Working Group has recommended on capturing these details on events instead of span attributes because many backend systems can struggle with those often large payloads.

Metrics

Request Volume: The total number of requests made to the LLM service. This helps in understanding the demand patterns and identifying any anomaly in usage, such as sudden spikes or drops.
Request Duration: The time it takes for a request to be processed and a response to be received from the LLM. This includes network latency and the time the LLM takes to generate a response, providing insights into the performance and reliability of the LLM service.
Costs and Tokens Counters: Keeping track of the total cost accrued and tokens consumed over time is essential for budgeting and cost optimization strategies. Monitoring these metrics can alert you to unexpected increases that may indicate inefficient use of the LLM or the need for optimization.

An example Setup

Prerequisites

Before we begin, make sure you have the following running in your environment:

Prometheus
Jaeger
Grafana

Setting Up the OpenTelemetry Collector

First, install the OpenTelemetry Collector.

Configuring the Collector

Next, you need to tell the Collector where to send the data. Here’s a simple configuration for sending metrics to Prometheus and traces to Jaeger:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
  memory_limiter:
    # 80% of maximum memory up to 2G
    limit_mib: 1500
    # 25% of limit up to 2G
    spike_limit_mib: 512
    check_interval: 5s

exporters:
  prometheusremotewrite:
    endpoint: 'YOUR_PROMETHEUS_REMOTE_WRITE_URL'
    add_metric_suffixes: false
  otlp:
    endpoint: 'YOUR_JAEGER_URL'

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite]

Instrument your LLM Application with OpenLIT

OpenLIT is an OpenTelemetry-based library designed to streamline the monitoring of LLM-based applications by offering auto-instrumentation for a variety of Large Language Models and VectorDBs.

It aligns with the GenAI semantic conventions established by the OpenTelemetry community and ensures a smooth integration process by not relying on vendor-specific span or event attributes or environment variables for OTLP endpoint configuration, offering a standard solution.

Install the library

To install the OpenLIT Python Library, run this command:

pip install openlit

Then, add these lines to your LLM application:

import openlit

openlit.init(
  otlp_endpoint="YOUR_OTELCOL_URL:4318",
)

You can instead pass the OpenTelemetry Collector URL using the OTEL_EXPORTER_OTLP_ENDPOINT also.

import openlit

openlit.init()

export OTEL_EXPORTER_OTLP_ENDPOINT = "YOUR_OTELCOL_URL:4318"

Visualize the metrics and traces

After your OpenTelemetry Collectors start sending metrics to Prometheus and traces to Jaeger, follow these steps to visualize them in Grafana. You can use any tool of your choice to visualize this data:

Add Prometheus as a data source

In Grafana, navigate to Connections > Data Sources.
Click Add data source and select Prometheus.
In the settings, enter your Prometheus URL, for example, http://<your_prometheus_host>, along with any other necessary details.
Select Save & Test.

Add Jaeger as a data source

In Grafana, navigate to Connections > Data Sources.
Click Add data source and select Jaeger.
In the settings, enter your Jaeger URL, for example, http://<your_jaeger_host>, along with any other necessary details.
Select Save & Test.

Add the dashboard

To make things easy, you can use the OpenLIT’s dashboard.

This guide showed you how to use OpenTelemetry, Prometheus, Jaeger, and Grafana to monitor your LLM Applications.

If you have any questions, reach out on my GitHub @ishanjainn or Twitter @ishan_jainn.

Last modified June 24, 2025: Bump prettier from 3.5.3 to 3.6.0 (#7170) (0930994d)