Resiliency

How to configure a resilient OTel Collector pipeline

You are viewing the English version of this page because it has not yet been fully translated. Interested in helping out? See Contributing.

The OpenTelemetry Collector is designed with components and configurations to minimize data loss during telemetry processing and exporting. However, understanding the potential scenarios where data loss can occur and how to mitigate them is crucial for a resilient observability pipeline.

Understanding Collector resilience

A resilient Collector maintains telemetry data flow and processing capabilities even when facing adverse conditions, ensuring the overall observability pipeline remains functional.

The Collector’s resilience primarily revolves around how it handles data when the configured endpoint (the destination for traces, metrics, or logs) becomes unavailable or when the Collector instance itself experiences issues like crashes.

Sending queue (in-memory buffering)

The most basic form of resilience built into the Collector’s exporters is the sending queue.

  • How it works: When an exporter is configured, it usually includes a sending queue that buffers data in memory before sending it to the downstream endpoint. If the endpoint is available, data passes through quickly.

  • Handling Endpoint Unavailability: If the endpoint becomes unavailable, for example due to network issues or a backend restart, the exporter cannot send data immediately. Instead of dropping the data, it adds it to the in-memory sending queue.

  • Retry Mechanism: The Collector employs a retry mechanism with exponential backoff and jitter. It will repeatedly attempt to send the buffered data after waiting intervals. By default, it retries for up to 5 minutes.

  • Data Loss Scenario:

    • Queue Full: The in-memory queue has a configurable size (default is often 1000 batches/requests). If the endpoint remains unavailable and new data keeps arriving, the queue can fill up. Once the queue is full, incoming data is dropped to prevent the Collector from running out of memory.
    • Retry Timeout: If the endpoint remains unavailable for longer than the configured maximum retry duration (default 5 minutes), the Collector will stop retrying for the oldest data in the queue and drop it.
  • Configuration: You can configure the queue size and retry behavior within the exporter settings:

    exporters:
      otlp:
        endpoint: otlp.example.com:4317
        sending_queue:
          storage: file_storage
          queue_size: 5_000 # Increase queue size (default 1000)
        retry_on_failure:
          initial_interval: 5s
          max_interval: 30s
          max_elapsed_time: 10m # Increase max retry time (default 300s)
    

Persistent storage (write-ahead log - WAL)

To protect against data loss if the Collector instance itself crashes or restarts, you can enable persistent storage for the sending queue using the file_storage extension.

  • How it works: Instead of just buffering in memory, the sending queue writes data to a Write-Ahead Log (WAL) on disk before attempting to export it.

  • Handling Collector Crashes: If the Collector crashes while data is in its queue, the data is persisted on disk. When the Collector restarts, it reads the data from the WAL and resumes attempts to send it to the endpoint.

  • Data Loss Scenario: Data loss can still occur if the disk fails or runs out of space, or if the endpoint remains unavailable beyond the retry limits even after the Collector restarts. Guarantees might not be as strong as dedicated message queues.

  • Configuration:

    1. Define the file_storage extension.
    2. Reference the storage ID in the exporter’s sending_queue configuration.
    extensions:
      file_storage: # Define the extension instance
        directory: /var/lib/otelcol/storage # Choose a persistent directory
    
    exporters:
      otlp:
        endpoint: otlp.example.com:4317
        sending_queue:
          storage: file_storage # Reference the storage extension instance
    
    service:
      extensions: [file_storage] # Enable the extension in the service pipeline
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [otlp]
    

Message queues

For the highest level of resilience, especially between different Collector tiers (like Agent to Gateway) or between your infrastructure and a vendor backend, you can introduce a dedicated message queue like Kafka.

  • How it works: One Collector instance (Agent) exports data to a Kafka topic using the Kafka exporter. Another Collector instance (Gateway) consumes data from that Kafka topic using the Kafka receiver.

  • Handling Endpoint/Collector Unavailability:

    • If the consuming Collector (Gateway) is down, messages simply accumulate in the Kafka topic (up to Kafka’s retention limits). The producing Collector (Agent) is unaffected as long as Kafka is up.
    • If the producing Collector (Agent) is down, no new data enters the queue, but the consumer can continue processing existing messages.
    • If Kafka itself is down, the producing Collector needs its own resilience mechanism (like a sending queue, potentially with WAL) to buffer data destined for Kafka.
  • Data Loss Scenario: Data loss is primarily tied to Kafka itself (cluster failure, topic misconfiguration, data expiration) or failure of the producer to send to Kafka without adequate local buffering.

  • Configuration:

    • Agent Collector Config (Producer):

      exporters:
        kafka:
          brokers: ['kafka-broker1:9092', 'kafka-broker2:9092']
          topic: otlp_traces
      
      receivers:
        otlp:
          protocols:
            grpc:
      
      service:
        pipelines:
          traces:
            receivers: [otlp]
            exporters: [kafka]
      
    • Gateway Collector Config (Consumer):

      receivers:
        kafka:
          brokers: ['kafka-broker1:9092', 'kafka-broker2:9092']
          topic: otlp_traces
          initial_offset: earliest # Process backlog
      
      exporters:
        otlp:
          endpoint: otlp.example.com:4317
          # Consider queue/retry for exporting *from* Gateway to Backend
      
      service:
        pipelines:
          traces:
            receivers: [kafka]
            exporters: [otlp]
      

Circumstances of data loss

Data loss can occur under these circumstances:

  1. Network Unavailability + Timeout: The downstream endpoint is unavailable for longer than the configured max_elapsed_time in the retry_on_failure settings.
  2. Network Unavailability + Queue Overflow: The downstream endpoint is unavailable, and the sending queue (in-memory or persistent) fills to capacity before the endpoint recovers. New data is dropped.
  3. Collector Crash (No Persistence): The Collector instance crashes or is terminated, and it was only using an in-memory sending queue. Data in memory is lost.
  4. Persistent Storage Failure: The disk used by the file_storage extension fails or runs out of space.
  5. Message Queue Failure: The external message queue (like Kafka) experiences an outage or data loss event, and the producing collector doesn’t have adequate local buffering.
  6. Misconfiguration: Exporters or receivers are incorrectly configured, preventing data flow.
  7. Disabled Resilience: Sending queues or retry mechanisms are explicitly disabled in the configuration.

Recommendations for preventing data loss

Follow these recommendations to minimize data loss and ensure reliable telemetry data collection:

  1. Always Use Sending Queues: Enable sending_queue for exporters sending data over the network.
  2. Monitor Collector Metrics: Actively monitor otelcol_exporter_queue_size, otelcol_exporter_queue_capacity, otelcol_exporter_send_failed_spans (and equivalents for metrics/logs) to detect potential issues early.
  3. Tune Queue Size & Retries: Adjust queue_size and retry_on_failure parameters based on your expected load, memory/disk resources, and acceptable endpoint downtime.
  4. Use Persistent Storage (WAL): For Agents or Gateways where data loss during a Collector restart is unacceptable, configure the file_storage extension for the sending queue.
  5. Consider Message Queues: For maximum durability across network segments or to decouple Collector tiers, use a managed message queue like Kafka if the operational overhead is acceptable.
  6. Use Appropriate Deployment Patterns:
    • Employ an Agent + Gateway architecture. Agents handle local collection, Gateways handle processing, batching, and resilient export.
    • Focus resilience efforts (queues, WAL, Kafka) on network hops: Agent -> Gateway and Gateway -> Backend.
    • Resilience between the application (SDK) and a local Agent (Sidecar/DaemonSet) is often less critical due to reliable local networking; adding queues here can sometimes negatively impact the application if the agent is unavailable.

By understanding these mechanisms and applying the appropriate configurations, you can significantly enhance the resilience of your OpenTelemetry Collector deployment and minimize data loss.