Managed Telemetry Platforms for Kubernetes Workloads
You are viewing the English version of this page because it has not yet been fully translated. Interested in helping out? See Contributing.
Summary
This blueprint provides strategic guidance for organizations aiming to follow Platform Engineering practices to ease adoption of OpenTelemetry tooling and standards across their engineering teams. This includes usage of SDKs, instrumentation libraries, configuration patterns, and Collector architectures to provide centrally managed telemetry platforms paired with self-serve tooling designed to be consumed “as-a-service”.
It is aimed at organizations operating in cloud and Kubernetes environments, wishing to provide a consistent, scalable, and governed telemetry platform across workloads owned by highly autonomous product teams, achieving the following outcomes:
- Consistent SDK and instrumentation configuration, improving time-to-value by facilitating adoption of organization-specific standards across all workloads, reducing cognitive load for product teams.
- Cohesive semantic conventions that allow for telemetry correlation between signals, applications, and domains, from client-side to infrastructure, providing high-quality telemetry to be utilized by manual or automatic analysis.
- Elimination of Collector configuration sprawl, reducing operational toil through consolidation of telemetry pipelines.
- Resilient, scalable, and reliable ingest pipelines for all telemetry signals avoiding single points of failure.
- Centralized telemetry governance and data optimization to reduce operational costs and carbon emissions by minimizing storage, network transfer, and compute requirements of telemetry processing.
- Future-proof telemetry pipelines that shield product teams from changes in the underlying observability backend, enabling data migrations or multi-vendor strategies with minimal changes to application instrumentation or collection infrastructure.
Background
As organizations increase the rate of adoption of cloud native standards and modern software delivery practices, they often adopt federated models where teams, or business units, operate with high autonomy and are made responsible for the full Software Development Lifecycle (SDLC) of their systems, from designing to operating software in production.
This “you build, you run it” model is designed to empower product delivery, however it can inadvertently create fragmented service management practices and cluttered observability landscapes that fail to reap the benefits of OpenTelemetry and modern observability tooling. Product teams prioritize feature delivery over Non-Functional Requirements (NFRs), like telemetry instrumentation, and see these tasks as a burden on their delivery goals.
To address this, organizations are widely adopting cloud native Platform Engineering models to reduce cognitive load and abstract complexity. By treating observability as a curated internal platform product, organizations can offer a paved road, or a golden path, that ensures high-quality, contextual observability with minimal friction, while allowing teams to remain focused on instrumenting domain-specific concepts impossible to capture in out-of-the-box telemetry.
Common challenges
Organizations operating in these federated, distributed environments typically face a distinct set of challenges that hinder effective observability and cloud native maturity.
1. Inconsistent configuration and low adoption of organization standards
In environments where product teams operate with autonomy, distinct ways of configuring individual applications and services for observability may coexist, while still operating under a shared compute layer. This includes setting up OpenTelemetry SDKs for applications, configuring instrumentation packages and libraries, or deciding how to propagate observability context from/to their dependencies.
Organizations may have a set of documented engineering standards they wish all engineers to follow, but they often rely on manual implementation of these standards by each individual team, including configuration and code-level changes. This is often treated by teams as an afterthought, not part of the software design process, and focused on a particular application without considering the overall distributed system in a holistic way.
---
title: "Figure 1: Silos due to lack of consistent semantic conventions and context propagation."
config:
flowchart:
curve: basis
---
flowchart LR
subgraph K8sNode["Kubernetes Node"]
direction TB
AppA["📦 App A"]:::node
AppB["📦 App B"]:::node
Collector["🔀 Collector"]:::node
end
subgraph TracesDB["🧵️Traces Backend"]
direction LR
TraceX[("🧵 Trace X")]:::node
TraceY[("🧵 Trace Y")]:::node
end
subgraph MetricsDB["📈 Metrics Backend"]
Metrics[("📈 Container Metrics")]:::node
end
User["👤 User"]:::node
User L_User_AppA@-- Inbound request --> AppA
AppA L_AppA_AppB@-. "Dependency<br>(broken trace context)" .-x AppB
TracesDB L_TracesDB_MetricsDB@x-. "Broken correlation<br>(missing k8s.* attributes)" .-x MetricsDB
AppA L_AppA_TraceX@== Spans ==> TraceX
AppB L_AppB_TraceY@== Spans ==> TraceY
Collector L_Collector_MetricsDB@== "Metrics<br>(k8s.pod.name=app-...)" ==> MetricsDB
classDef node fill:#ffffff, stroke:#818cf8, stroke-width:2px, color:#6b7280
style K8sNode fill:#eef2ff, stroke:#818cf8, stroke-width:2px, color:#818cf8
style TracesDB fill:#eef2ff, stroke:#818cf8, stroke-width:2px, color:#818cf8
style MetricsDB fill:#eef2ff, stroke:#818cf8, stroke-width:2px, color:#818cf8
linkStyle 0 stroke:#7dd3fc, fill:none, stroke-width:3px
linkStyle 1 stroke:#fca5a5, fill:none, stroke-width:3px
linkStyle 2 stroke:#fca5a5, fill:none, stroke-width:3px
linkStyle 3,4,5 stroke:#a3e635, fill:none, stroke-width:3px
L_User_AppA@{ animation: slow }
L_AppA_AppB@{ animation: slow }
L_TracesDB_MetricsDB@{ animation: slow }
L_AppA_TraceX@{ animation: fast }
L_AppB_TraceY@{ animation: fast }
L_Collector_MetricsDB@{ animation: fast }This leads to:
- Inconsistent Semantic Conventions: Telemetry lacks common
Resource attributes (e.g.,
service.version,k8s.cluster.name,example.cost.center), breaking correlation across different signals, applications, and system layers, and limiting the usefulness of observability data for automatic analysis. - Context silos: Without consistent context propagation (e.g. W3C Trace Context) baked into every SDK, distributed traces break at service boundaries, making it impossible to tie backend performance regressions to customer-facing business impact.
- SDK version fragmentation: Widely different versions of OpenTelemetry SDKs running in production, introducing maintenance and security concerns.
- High cognitive load: Developers must manually configure SDKs and instrumentation packages for every new service, increasing toil and the risk of misconfiguration.
- Reduced velocity: Any change in engineering standards related to telemetry instrumentation, or any change in the underlying observability backend such as data or protocol migrations, creates friction and reduces overall velocity in the organization as technology adoption is ultimately hampered by manual implementation.
2. Collector configuration sprawl across clusters
As OpenTelemetry adoption scales, and organizations deploy across tens or hundreds of Kubernetes clusters, managing individual OpenTelemetry Collector configurations manually across these environments creates a maintenance burden. This is especially challenging in organizations where different Collector deployments are handled by different teams.
This leads to:
- Configuration drift: Different clusters end up with varying parsing rules, filtering logic, and endpoint configurations, causing unpredictable telemetry behavior.
- Lack of separation of concerns: There is no clear distinction between the different types of telemetry processing done at different layers of Collectors (e.g. where to transform, where to sample) which can lead to inconsistent or incomplete data.
- Manual toil: Platform teams spend an excessive amount of time on repetitive configuration tasks and manual updates, rather than building scalable solutions.
- Unreliable rollouts: Without version-controlled, auditable deployments, applying a fix or a new configuration across the fleet becomes highly risky and error-prone.
3. Data pipelines not optimized for observability data requirements
In some legacy instrumentation models, applications or instrumentation agents often export telemetry directly to telemetry backends. This model lacks a way to process and transform telemetry between the application and the backend, reducing data sovereignty. It can also add extra complexity if the backend is a third-party vendor, or any endpoint requiring public traffic or authentication. Managing credentials across thousands of applications can be challenging, and sporadic network connectivity issues between a single exporter and a public endpoint can create service interruptions.
Conversely, in environments where data pipelines are centralized, data requirements for telemetry data are often mixed with those for other types of data. This can lead to solutions that are optimized for completeness (e.g. audit logging, financial data reporting) rather than context-aware transformations and low-latency processing. This increases the time between data emission and actionable insights, necessary to maintain reliable operations.
This leads to:
- Single points of failure: Direct egress from hundreds of individual applications to the internet strips the organization of central network governance and load balanced exports.
- Latency and operational value: Ultimately, stale observability data is almost as good as no observability data. Overly complex logging pipelines can introduce significant lag, rendering real-time operational alerts useless during a major incident.
- Lack of central control: Platform teams cannot easily reroute data, change vendors, or apply global network policies when configurations are deeply embedded within individual applications.
The scope of this blueprint is defined by common challenges faced by platform teams to provide pipelines optimized for low-latency and efficient resource usage. In certain scenarios, like those requiring audit logging or business reporting, balancing completeness or durability guarantees is critical. These challenges are out of scope for this blueprint and may be targeted in a separate blueprint. See our guidance if you are interested in contributing.
4. Lack of telemetry governance and low ROI
Without centralized governance and measurable adoption of observability standards, autonomous teams may generate vast amounts of low-value data, reducing the signal-to-noise ratio. OpenTelemetry signals are often not used for their intended purpose, ultimately making their production harder to maintain for platform teams (e.g. having to ensure fast and accurate querying over days or weeks of individual logs simply to compute the number of requests for a given service). As traffic grows and telemetry volume increases, teams in charge of observability have no scalable way of ensuring data quality across their landscape.
This leads to:
- Unattributed data quality issues: As consistent semantic conventions are not enforced, platform teams cannot associate telemetry spend or data quality with specific business units or engineering teams.
- Inefficient data types: Organizations incur heavy storage and indexing costs for raw logs or other signals when not used for their intended purpose, while reducing the overall quality of the insights extracted from observability data.
- Unnecessary costs: Increasing costs associated with data storage, network egress, or ingest into a particular backend, incurred from data that does not always improve the insights one may require to operate systems reliably.
- Carbon emissions: Processing of low-value data can be detrimental to achieving green software targets, including scope 3 emissions from embedded carbon present in the devices necessary for fast retrieval of observability data, e.g. SSDs.
- High cognitive load: Large data volumes not only result in unnecessary costs, they may also increase noise, forcing users and agents to filter through low-quality data to find relevant telemetry.
Multi-tenant environments often deal with strict compliance requirements (GDPR, HIPAA, PCI) and security concerns such as authentication and encryption between pipeline layers. These challenges are out of scope for this blueprint and may be targeted in a separate blueprint. See our guidance if you are interested in contributing.
5. Low observability and operational efficiency of SDKs and data pipelines
One of the challenges of operating OpenTelemetry SDKs and Collectors in production is identifying if, and when, the default configuration applied for aspects regarding queuing, retrying, or batching of telemetry data is not optimal for a particular environment. OpenTelemetry’s sensible defaults may not be suitable either to implement a leaner approach on resource utilization, or higher reliability guarantees. This may depend on architectural patterns in use, e.g. exporting to a local cluster endpoint may require less buffering than a public internet endpoint.
This leads to:
- Silent data drops and export failures: Data exports suffer from failures to export to backends, or Collectors, ultimately dropping data, without those errors being observed or alerted on.
- Unnecessary resource utilization: Operators overprovision resources on SDKs and Collectors, increasing resource utilization, potentially affecting performance overhead and cost.
General Guidelines
1. Centralize default, extensible configuration for SDKs and instrumentation packages
Challenges addressed: 1, 4 | Implementation actions: 1, 2
We recommend teams in charge of observability tooling maintain a set of resources (see Action 1) to provide basic, out-of-the-box configuration for SDKs and instrumentation libraries. The aim is for applications deployed in a Kubernetes cluster to emit a basic level of telemetry, and to propagate context from and to dependencies, with minimal input required from application owners, e.g. at most adding an annotation, or calling a shared internal library.
Platform teams should ensure that this base configuration remains extensible, allowing application owners to control different aspects of the SDK (e.g. buffer sizes, exporter retries) and instrumentation libraries, to meet the requirements specific to their applications.
By implementing this guideline, organizations can expect to achieve:
- Cohesive organization standards: Specific organization standards (e.g. resource attributes, exporter endpoint, etc) are applied automatically across the stack.
- Consistent context propagation: Trace Context is propagated between services using compatible propagator configurations.
- Lower cognitive load: Application owners can abstract themselves from lower-level configuration, such as that related to setting up the OpenTelemetry SDK.
- Easier maintenance: Effort to adopt engineering standards and best practices in observability is minimized, as new standards can be rolled out via version bumps of internal tooling.
2. Establish shared ownership for telemetry production
Challenges addressed: 4, 5 | Implementation actions: 1, 2, 5
To balance governance and autonomy, platform teams operating in the environments described in this blueprint should aim to “shift left” on instrumentation, ensuring that application owners have full control and ownership of the telemetry emitted by their applications. Default configurations mentioned in Guideline 1 should ensure that provenance of data is guaranteed, including technical attributes (e.g. cluster, deployment, pod) and organizational information (e.g. team, business domain), with the aim of making it trivial to identify the source of telemetry, and the owning team.
OpenTelemetry client design principles establish a clear separation between the API, being a no-op implementation by default, and the SDK, which provides an implementation for that API when registered. This provides a clear separation of responsibilities, and allows application owners to rely solely on the OpenTelemetry API, focusing their efforts on enriching telemetry with domain-specific context (e.g., business transactions, user IDs) that is impossible to capture generically, while relying on the provided default configuration to produce telemetry out of the box.
---
title: "Figure 2: Shared ownership model between platform teams and application owners."
config:
flowchart:
curve: basis
---
flowchart TD
subgraph User["Application Ownership"]
Application["Application"]:::node
Config["⚙️<br>Config"]:::node
end
subgraph Platform["Platform Ownership"]
Collector[("🔀️<br>Collector<br>Pipelines")]:::node
BaseConfig["⚙️<br>Base Config"]:::node
end
subgraph Application["Application"]
AppCode["💼<br>Biz Logic"]:::node
ThirdParty["👽<br>3rd-Party Libs"]:::node
subgraph OTel["OpenTelemetry"]
InstLibs["📦<br>Instrumentation"]:::node
OTelSDK["📦<br>OTel SDK"]:::node
OTelAPI["📦<br>OTel API"]:::node
end
end
Sink[("🗄️ Observability Backend")]:::node
AppCode L_AppCode_API@-- Uses --> OTelAPI
ThirdParty L_ThirdParty_API@-- Uses --> OTelAPI
InstLibs L_InstLibs_API@-- Uses --> OTelAPI
OTelAPI L_SDK_API@-. Implemented By .-> OTelSDK
Config L_Config_InstLibs@-.-> InstLibs
Config L_Config_SDK@-.-> OTelSDK
OTelSDK L_SDK_Collector@-- Exports --> Collector
Collector L_Collector_Sink@--> Sink
BaseConfig L_BaseConfig_Config@-- Extended By --> Config
classDef node fill:#ffffff, stroke:#818cf8, stroke-width:2px, color:#6b7280
style User fill:#eef2ff, stroke:#818cf8, stroke-width:1px, color:#818cf8
style Platform fill:#eef2ff, stroke:#818cf8, stroke-width:1px, color:#818cf8
style Application fill:#eef2ff, stroke:#818cf8, stroke-width:1px, color:#818cf8
style OTel fill:#dde4ff, stroke:#818cf8, stroke-width:1px, color:#818cf8
linkStyle 0,1,2 stroke:#7dd3fc, fill:none, stroke-width:3px
linkStyle 3,4,5,8 stroke:#fde68a, fill:none, stroke-width:3px
linkStyle 6,7 stroke:#a3e635, fill:none, stroke-width:3px
L_AppCode_API@{ animation: fast }
L_ThirdParty_API@{ animation: fast }
L_InstLibs_API@{ animation: fast }
L_SDK_API@{ animation: slow }
L_Config_InstLibs@{ animation: slow }
L_Config_SDK@{ animation: slow }
L_SDK_Collector@{ animation: fast }
L_Collector_Sink@{ animation: fast }
L_BaseConfig_Config@{ animation: slow }This model relies on OpenTelemetry’s API design to abstract implementation details. We recommend considering direct usage of the different signal APIs and avoiding building further abstractions around them, unless these provide more value than simply hiding implementation details. When required, SDK features (e.g. Metric Views, or Span Processors) can be utilized to transform telemetry at the application level (see Guideline 4).
Weaver can help teams to manage organization-specific semantic convention registries, and to measure and validate adherence to those, ensuring instrumentation quality by design. Learn more about Weaver in this blogpost. Semantic conventions governance is out of scope for this blueprint and may be targeted in a future blueprint. See our guidance if you are interested in contributing.
Ultimately, application owners should remain owners of the telemetry data
emitted by their applications (both manually and automatically instrumented),
and be accountable for its quality and resiliency. This includes monitoring and
alerting on SDK telemetry, automatically configured by the platform team
in languages that support it, and optimize their configuration according to
their specific application needs. This involves tuning SDK components like the
BatchSpanProcessor or the PeriodicMetricReader to change buffer sizes, retry
queues, cardinality limits, or timeouts, as required by their telemetry volumes.
By implementing this guideline, organizations can expect to achieve:
- Correlation to business outcomes: Telemetry emitted by applications contains the necessary domain and business logic context to correlate user experience to technical components and infrastructure.
- Clear ownership and responsibilities: Provenance of data is ensured, allowing teams to measure telemetry quality and ensure standards are adopted at scale.
- Improved usage of telemetry signals: As application owners become more familiar with OpenTelemetry signals, guided by organization standards, their optimal usage of OpenTelemetry APIs will improve.
- Reliable telemetry production: Monitoring internal SDK metrics provides application or platform owners with the necessary information to optimize aspects regarding queuing, retrying, or batching of telemetry data.
3. Maintain a set of centrally managed Collector Gateways
Challenges addressed: 2, 3, 4 | Implementation actions: 1, 3, 5
We recommend that telemetry in this type of Kubernetes environment is automatically ingested into a centralized layer deployed as an OpenTelemetry Collector Gateway. The base configuration provided as part of Guideline 1 should ensure that telemetry is exported to this layer using OTLP.
---
title: "Figure 3: General behaviour of an OpenTelemetry Collector Gateway."
config:
flowchart:
curve: basis
---
flowchart LR
subgraph App["Application"]
SDK["📦 OTel SDK"]:::node
end
LB["⚖️Load Balancer"]:::node
subgraph OTelCol["Collector Gateway"]
direction TB
C1["🔀 Collector 1"]:::node
C2["🔀 Collector 2"]:::node
C3["🔀 Collector 3"]:::node
end
Backend[("🗄️ Backend")]:::node
SDK L_SDK_LB@-- "OTLP" --> LB
LB L_LB_C1@--> C1
LB L_LB_C2@--> C2
LB L_LB_C3@--> C3
C1 L_C1_Backend@--> Backend
C2 L_C2_Backend@--> Backend
C3 L_C3_Backend@--> Backend
classDef node fill:#ffffff, stroke:#818cf8, stroke-width:2px, color:#6b7280
style App fill:#eef2ff, stroke:#818cf8, stroke-width:1px, color:#818cf8
style OTelCol fill:#eef2ff, stroke:#818cf8, stroke-width:1px, color:#818cf8
linkStyle 0,1,2,3,4,5,6 stroke:#a3e635, fill:none, stroke-width:3px
L_SDK_LB@{ animation: fast }
L_LB_C1@{ animation: fast }
L_LB_C2@{ animation: fast }
L_LB_C3@{ animation: fast }
L_C1_Backend@{ animation: fast }
L_C2_Backend@{ animation: fast }
L_C3_Backend@{ animation: fast }In multi-tenant environments, multiple Collector Gateways may need to be chained to accommodate for different scenarios. For instance, multi-cluster setups with local Gateways per cluster and a global Gateway for tail-sampling (see Guideline 4), or namespace-scoped Gateways managed by independent teams, feeding into a cluster-wide Gateway in heavily federated environments.
Ideally, base SDK configuration should automatically select the most optimal Collector endpoint and any necessary credentials according to information available in the application environment (e.g. locality-based traffic routing, conditionally changing server address depending on environment name, etc).
Finally, depending on organization-specific conditions, different OpenTelemetry signals may be given different non-functional requirements. For instance, due to their stable telemetry volumes and their use in critical alerts, metrics may be assigned higher reliability requirements than spans, favoring dropping data on the latter before affecting the former. To accommodate for these conditions platform teams may consider different options, including:
- Isolated Gateways per signal: Deploy separate Gateways for logs, metrics, and spans. Isolated deployments simplify compute resource allocation and capacity planning per signal, but shared processor configuration must be duplicated across Gateways. This can be managed through external templating tools, e.g. Kapitan, Kustomize, or using multiple config locations to override each other. However, it may increase maintenance toil.
- Multiple memory limiters on a single Gateway: Defining separate
memory_limiter configurations per signal, with different thresholds.
This relies on the OTLP receiver in front of a
memory_limiterreturning a retryable error code to OTLP clients (e.g. SDKs or other Collectors) when telemetry is refused, applying backpressure as required. Pipelines with lower priority can then be configured with lower memory limiter thresholds in order to apply backpressure earlier, leaving memory headroom for higher priority pipelines.
Platform engineers should make use of internal Collector telemetry to
ensure the reliability of the data ingested, processed, and exported by their
pipelines, and optimize their configuration accordingly. This includes
configuring components like the memory_limiter, or OTLP options like
sending_queue or retry_on_failure. These metrics should be used to avoid
default CPU-based autoscaling of Collector Gateways, scaling fleets based on
pipeline queue depth or memory consumption, to handle sudden telemetry spikes.
By implementing this guideline, organizations can expect to achieve:
- Pipelines optimized for observability data requirements: By combining OTLP exporter and receiver configurations with load-balanced, reliable Collector pipelines, teams are able to fulfil their reliability requirements on a per-signal basis.
- Efficient use of compute resources: Horizontally scaled, centralized Gateways utilize compute resources more efficiently than per-node DaemonSets or per-pod Sidecars in heterogeneous, multi-tenant environments. DaemonSets normally have to be over provisioned to handle variable node sizes (i.e. a single node may serve 4 or 40 application pods) and variable per-pod telemetry volume that fluctuates over time. Keeping per-node footprints small matters as teams often struggle to schedule workloads on smaller nodes. A central Gateway tier scales independently, sized to total telemetry volume.
- Consolidated Collector configuration: As described in Action 3, this model allows for a consolidated deployment of Collector configuration across multiple layers, minimizing maintenance toil and reducing risk of change failure.
4. Efficiently aggregate, process, and sample telemetry at different layers
Challenges addressed: 3, 4 | Implementation actions: 2, 4
At an application level, OpenTelemetry client design decouples instrumentation APIs and their SDK implementations. This allows instrumentation authors (including application or library owners) to use the API to record measurements, create spans, or emit log records, without having to define how those will be aggregated in memory, processed, and ultimately exported. This decision can be deferred to the moment when meter, tracer, and logger providers are created as part of the SDK setup. Configuration of these aspects should be shared, with platform teams providing a basic layer of configuration, and application owners extending that configuration for their particular use cases.
At a distributed system level, different trace sampling techniques may be used to efficiently store the most valuable traces in a consistent manner. See Appendix 1 for an introduction to these techniques.
When trace sampling is implemented, consistent use of semantic conventions becomes crucial. Metrics provide complete (yet aggregated) views of telemetry, using Exemplars to correlate to high granularity trace spans for a given operation, which can then link to logs and other telemetry signals (e.g. profiles). Using standard semantic conventions and consistent Resource attributes also empowers correlation between these signals, allowing operators to “zoom in” from long-term, aggregated metric streams to highly-granular, contextual traces.
The following diagram provides a summary of different layers where aggregation, processing, and sampling may be configured in a tail-sampling, multi-cluster scenario.
---
title: "Figure 4: Multi-tenant architecture with Trace ID based global load balancing and tail sampling."
config:
flowchart:
curve: basis
---
flowchart LR
subgraph LocalA["Local Gateway"]
direction LR
LA1["🔀 Collector"]:::node ~~~ LA2["🔀 Collector"]:::node
end
subgraph ClusterA["Cluster A"]
direction TB
AppA["📦 OTel SDK"]:::node
LocalA
end
subgraph LocalB["Local Gateway"]
direction LR
LB1["🔀 Collector"]:::node ~~~ LB2["🔀 Collector"]:::node
end
subgraph ClusterB["Cluster B"]
direction TB
AppB["📦 OTel SDK"]:::node
LocalB
end
subgraph LB_Layer["Load Balancing Layer"]
direction TD
GLB1["🔀 Collector"]:::node ~~~ GLB2["🔀 Collector"]:::node ~~~ GLB3["🔀 Collector"]:::node
end
subgraph SamplingLayer["Tail Sampling Layer"]
direction TD
TS1["🔀 Collector"]:::node ~~~ TS2["🔀 Collector"]:::node ~~~ TS3["🔀 Collector"]:::node
end
subgraph GlobalTier["Unified Global Gateway"]
direction LR
LB_Layer
SamplingLayer
end
ObsBackend[("🗄️ Observability Backend")]:::node
AppA L_AppA_LocalA@-- OTLP --> LocalA
AppB L_AppB_LocalB@-- OTLP --> LocalB
LocalA L_LocalA_LBLayer@-- "OTLP (spans)" --> LB_Layer
LocalB L_LocalB_LBLayer@-- "OTLP (spans)" --> LB_Layer
LB_Layer L_LBLayer_Sampling@-- "Route by<br>Trace ID" --> SamplingLayer
LocalA L_LocalA_Backend@-- "OTLP (metrics & logs)" --> ObsBackend
LocalB L_LocalB_Backend@-- "OTLP (metrics & logs)" --> ObsBackend
SamplingLayer L_Sampling_Backend@-- "OTLP (sampled spans)" --> ObsBackend
AppB -.- n1["Head sampling, aggregation, limits, etc."]:::note
LocalA -.- n2["Redaction, enrichment, OTTL, governance, etc."]:::note
SamplingLayer -.- n4["Sample traces, post-processing"]:::note
classDef node fill:#ffffff, stroke:#818cf8, stroke-width:2px, color:#6b7280
classDef note fill:#f9fafb, stroke:#c7d2fe, stroke-width:1px, color:#9ca3af
style ClusterA fill:#eef2ff, stroke:#818cf8, stroke-width:1px, color:#818cf8
style ClusterB fill:#eef2ff, stroke:#818cf8, stroke-width:1px, color:#818cf8
style LocalA fill:#dde4ff, stroke:#818cf8, stroke-width:1px, color:#818cf8
style LocalB fill:#dde4ff, stroke:#818cf8, stroke-width:1px, color:#818cf8
style GlobalTier fill:#eef2ff, stroke:#818cf8, stroke-width:1px, color:#818cf8
style LB_Layer fill:#dde4ff, stroke:#818cf8, stroke-width:1px, color:#818cf8
style SamplingLayer fill:#dde4ff, stroke:#818cf8, stroke-width:1px, color:#818cf8
linkStyle 6,7 stroke:#7dd3fc, fill:none, stroke-width:3px
linkStyle 8,9,10,13 stroke:#a3e635, fill:none, stroke-width:3px
linkStyle 11,12 stroke:#fde68a, fill:none, stroke-width:3px
linkStyle 14,15,16 stroke:#c7d2fe, fill:none, stroke-width:1px
L_AppA_LocalA@{ animation: fast }
L_AppB_LocalB@{ animation: fast }
L_LocalA_LBLayer@{ animation: fast }
L_LocalB_LBLayer@{ animation: fast }
L_LBLayer_Sampling@{ animation: fast }
L_LocalA_Backend@{ animation: fast }
L_LocalB_Backend@{ animation: fast }
L_Sampling_Backend@{ animation: fast }Generally, processing of telemetry should be done as close as possible to the application layer, avoiding compute and transfer costs. However, deferring processing decisions to different Collector layers may be desirable in certain situations, such as facilitating maintenance, enforcing standards, performing advanced filtering/transformations with OTTL, or securing pipelines with redaction rules to ensure sensitive information never reaches a particular backend.
By combining intelligent sampling, metric aggregation at different layers, and central transform/filter processors to reduce noisy telemetry, this architecture can reduce transfer and compute costs while preserving operational visibility for engineering teams.
By implementing this guideline, organizations can expect to achieve:
- Efficient telemetry volumes: Optimal use of OpenTelemetry signals, sampling, and aggregation provide telemetry volumes that allow organizations to balance between high-granularity, cost, and observability requirements.
- Efficient use of compute resources: Position data processing at different levels limits data transfer and compute resources associated with data that can be aggregated or filtered at early stages.
- Central governance and guardrails: Platform teams have a central point to control data emissions, allowing them to filter, transform, redact, or completely block telemetry that does not follow organization standards or adhere to data volume limits, safeguarding the organization from emitting unwanted data to backends.
Implementation
1. Use OpenTelemetry Operator, or internal shared packages, for application-level configuration
Guidelines implemented: 1
If the environment in scope is in the supported Kubernetes versions and instrumented languages, we recommend prioritizing the use of the OpenTelemetry Operator for Kubernetes for auto-instrumentation. This involves:
- Installing the OpenTelemetry Operator.
- Creating the relevant
InstrumentationCRs to configure SDKs and instrumentation. - Adding annotations to individual pods or namespaces (to instrument all pods in a namespace).
If deploying the OpenTelemetry Operator is not possible/compatible, we recommend providing application owners with build-time resources to easily configure the OpenTelemetry SDK and instrumentation libraries. This can be implemented following two main models:
- For languages supported by zero-code instrumentation, we recommend
providing base container images to download instrumentation agents/libraries,
provide default configuration, and configure the base
CMDon the resulting container image to utilize these settings. - For languages not supported by zero-code instrumentation, we recommend providing shared language-specific libraries that take care of configuring the OpenTelemetry SDK and instrumentation libraries programmatically, providing hooks for users of said libraries to extend this configuration as required.
This non-operator model puts application owners in charge of using these base container images or shared libraries in their codebase. While it may initially require more effort than auto-attached instrumentation, it provides a mechanism for platform teams to manage phased upgrades or configuration changes with minor version bumps of their internal libraries, requiring no further code changes from application owners.
When managing centralized configuration in base container images or internal libraries, and when supported by the language, we recommend standardizing on the use of declarative configuration. Although currently not fully supported by all languages, this YAML-based configuration model provides consistency in SDK and instrumentation configuration.
2. Include organization standards into default, extensible application-level configuration
Guidelines implemented: 1, 2, 4
Regardless of how the configuration is delivered as part of Action 1, we recommend the platform team to include the following minimum base configuration as part of their offering:
Exporters: OTLP HTTP/protobuf (default) or OTLP gRPC configured to export to the most optimal Collector (e.g. local Gateway in the same cluster). See Appendix 2 and Action 3 for more details on side effects of OTLP gRPC used with standard Kubernetes Services.
- Note: Backend/SaaS endpoints or API keys should not be included in application-level configuration, as we recommend handling these at a Collector Gateway.
Propagators: W3C Trace Context (
tracecontext) to ensure distributed traces do not break across service boundaries. If necessary, include legacy formats as secondary options (Propagators API will prioritize in the order they are configured).Resource detectors: Auto-detectors for the underlying infrastructure (e.g., cloud provider, Kubernetes, OS, container) to achieve consistency with no manual input.
Instrumentation libraries: Ensure a minimal set of instrumentation libraries are configured out of the box. If auto-instrumentation is being used, platform teams should not default to enable all by instrumentation libraries by default, carefully selecting the ones most critical to their environment, prioritizing client and server instrumentation (e.g. gRPC, HTTP, messaging, database).
Processors, readers and views: Settings specific to the backend in use (e.g. aggregation temporality, export intervals, attribute limits) or organization-wide standards (e.g. span/metric attributes).
- Note: Depending on language implementations, OTLP exporters may retry
when receiving retryable errors such as HTTP
429,503, or gRPCUNAVAILABLEwith optionalRetryInfo. However, these exporters do not have the same capabilities as Collectors in terms of sending queues, and will drop batches of data if unsuccessful. Platform teams should manage sensible defaults for these buffer sizes and prioritize export to local Collectors (e.g. cluster-local Gateway) to move telemetry out of the application process as fast and reliably as possible. Application owners should monitor SDK telemetry, when available, and react accordingly.
- Note: Depending on language implementations, OTLP exporters may retry
when receiving retryable errors such as HTTP
Organization-specific resource attributes: Standard conventions critical for routing, billing, and ownership. At a minimum, we recommend:
service.nameideally extracted from existing environment variables or labels injected via CI/CD tooling.service.versionto identify telemetry source during blue/green deployments or progressive rollouts.service.namespaceorservice.ownerfor resource ownership.deployment.environment.name(e.g.production,staging).- Other attributes injected as environment variables via the Kubernetes
Downward API (i.e.
valueFrom.fieldRef.fieldPath) standardized across application deployment templates.
Platform teams must provide ways for application owners to override and extend this default configuration. The mechanism to do so will depend on the methods established in Action 1 for providing OTel configuration. Possible options are documented in Appendix 3
3. Use OpenTelemetry Operator or Helm Charts to deploy Collector Gateways
Guidelines implemented: 3
To deploy centralized Gateway tiers, platform teams should standardize on either the OpenTelemetry Operator or the official OpenTelemetry Helm Charts. Both support GitOps workflows, but they require specific architectural considerations for enterprise workloads:
- OpenTelemetry Operator: Ideal if the Operator is already used for
application auto-instrumentation (Action 1). The Gateway can be
deployed by creating an
OpenTelemetryCollectorCR and settingmode: deploymentormode: statefulset(depending on requirements). The Operator abstracts away much of the Kubernetes boilerplate. See the Operator documentation for further guidance on how to enable autoscaling. - Official Helm Charts: A better option if infrastructure teams prefer
granular control over native Kubernetes manifests (e.g., specific
Ingressconfigurations,PodDisruptionBudgets, or complex affinity rules) without relying on CRDs.
Regardless of the deployment tool chosen, the Gateway tier is a critical point, and owners should ensure resiliency is configured from the start:
- Configure memory_limiter: When configured as the first processor in
every Collector pipeline, this prevents Out-of-Memory (OOM) crashes during
massive telemetry spikes by forcing the Collector to drop data and/or apply
backpressure when memory usage hits a configured threshold. As mentioned in
Guideline 3, different
memory_limiterprocessors per signal may be required. - Configure otlp or otlp_http exporter: Ensure queues and
retries are aligned with expectations on reliability vs resource consumption,
handling transient backend failures before dropping data. In particular,
consider
sending_queueoptions likebatch, which allows for efficient network transfer and backpressure propagation, andblock_on_overflow, which controls if the Collector should drop data or wait until space becomes available if the queue (persistent or in-memory) is full. - Consider file_storage extension: If dropping data on extended
observability backend service interruptions is critical to the functioning of
the business, consider configuring
sending_queue.storagein your OTLP exporter with the file_storage extension. With this extension configured, if the backend is unavailable or rate-limits exports, the Collector will buffer data to disk and automatically retry, preventing data loss. See Appendix 4 for notes on deploying thefile_storageextension. - gRPC load balancing: OTLP/gRPC can be very efficient, but standard Kubernetes service routing can make it inefficient. See Appendix 2 to implement gRPC load balancing, or consider OTLP/HTTP (the default for most SDKs).
- Scale on memory and internal telemetry: Utilize the Kubernetes Horizontal Pod Autoscaler (HPA) combined with custom metrics (see Action 5). Configure the cluster to scale Gateway replicas based on memory utilization, active connections, or pipeline queue depth.
- Configuration as code: Store the Helm values or Operator CRs in a central Git repository and use tools like ArgoCD or Flux to deploy them. This provides an audit trail and allows for phased rollouts and instant rollbacks.
4. Configure Collector processors for efficient telemetry volumes
Guidelines implemented: 4
To enrich telemetry with infrastructure context, reduce transfer and ingestion costs from low-value telemetry data, and enforce compliance before data leaves the corporate network, the platform team should consider configuring pipelines in Collector Gateways to execute the following processing steps (in order):
- k8s_attributes processor: While some Kubernetes resource details
(like pod ID or namespace name) can be appended at the application level (see
Action 2), the platform team must ensure 100% compliance for
unmanaged workloads. This includes fields not available via the Downward API.
Configure this processor to extract and append attributes like
k8s.deployment.name,k8s.statefulset.name, etc. based on the incoming connection’s pod IP. See Appendix 5 for details to consider when using thek8s_attributesprocessor. - Processors to filter and transform data: As a fallback measure on
applications that could not apply these settings at the SDK level before being
sent to the Collector, use processors like attributes, filter,
redaction, resource, or transform to define rules to:
- Drop single-span traces and access logs for routine endpoints (
/health,/metrics,/ready) or non-actionable debug logs (level=DEBUGorlevel=TRACE). - Remove specific noisy attributes (e.g.
process.command_line) that may be less useful in Kubernetes environments where these attributes are present in CI/CD pipelines. - Any other processing to remove low-value, noisy telemetry.
- Drop single-span traces and access logs for routine endpoints (
- tail_sampling processor: Define strict retention policies, for instance keeping 100% of traces containing errors or exceeding a latency threshold, and a small baseline (e.g., 5%) of successful, normal-duration requests. As documented in Guideline 4, this requires two layers of collectors, using the load_balancing exporter on the first layer to route traces to the second layer based on Trace ID. See more information about load balancing exporting in our documentation.
This is not an exhaustive list, and OpenTelemetry Collectors have many processors and connectors that allow organizations to extract more value out of their telemetry data.
5. Monitor SDKs and Collectors to ensure reliability requirements
OpenTelemetry SDKs and Collectors export standard telemetry describing the internal state of their components in operation. Application owners and platform teams should ensure these are produced reliably, monitored, and actioned as required.
To identify and monitor data loss occurring before telemetry even leaves the application process (e.g., if the SDK’s internal queue fills up), we recommend:
- Where supported by the language ecosystem (e.g. Java via the
opentelemetry-sdkinstrumentation library, or Go via thesdk/metricpackage), enable SDK self-metrics to expose internal queue capacities, dropped spans, and exporter latency. OpenTelemetry SDK Semantic Conventions define the telemetry to be produced by SDKs, but support varies depending on language. - Languages lacking native SDK metric support for internal telemetry may still
support internal diagnostics in different ways (e.g. .NET’s
EventSource, Java’sjava.util.logging, or Node.js’sdiag). Users should refer to specific implementations to configure for their particular needs and verbosity.
To monitor the health of the aggregation and processing tiers, the platform team must actively capture and alert on internal Collector telemetry. We recommend following these steps:
- Export internal telemetry via OTLP: Configure the
service.telemetryblock to emit internal metrics via OTLP to the observability backend, following company-wide standards. In addition to monitoring, these metrics should also be used as the data source for autoscaling decisions (see Action 3). Please note, this OTLP configuration is separate from the OTLP exporter configured on Collector pipelines. - Monitor and troubleshoot: Follow advice present in the monitor and troubleshoot sections of Collector documentation, and create the necessary high-priority alerts to detect resource exhaustion and failed receiving/exporting before data is dropped by individual replicas.
Reference Implementations
- Adobe: An OpenTelemetry pipeline designed for simplicity at scale
- Mastodon: Running OpenTelemetry Collectors in production with a small team
- Skyscanner: Managing OpenTelemetry Collectors across 24 production clusters
Appendix
1. Distributed trace sampling techniques
At a very high level, sampling can be mainly configured at two distinct layers:
- SDK: Head sampling configured at the SDK level provides an efficient use of compute resources as unsampled traces are never recorded or exported by a given application. However, sampling decisions need to be made at span creation, normally resulting in probabilistic sampling, which could miss critical traces (e.g. those containing errors).
- Collector: Collectors empower two main sampling techniques:
- Probabilistic sampling: Can be configured at any Collector layer and does not require coordination between Collectors as long as the same algorithm and seed are in use for the same trace.
- Tail sampling: A single Collector replica must store all spans for a given trace in memory before making a decision. As single replica deployment is not recommended in production environments, this model normally requires one layer of Collectors to load balance spans according to trace ID and another to perform sampling.
Tail sampling requires more resources to operate and maintain. However, it provides a richer way of defining sampling policies that allow organizations to efficiently store only the traces that are critical for their services operations. For instance, traces with durations longer than a particular threshold, or those containing errors across any span in a given trace.
Distributed trace sampling is a complex topic in and on itself, designing a sampling architecture across all different layers. These challenges are out of scope for this blueprint and may be targeted in a separate blueprint. See our guidance if you are interested in contributing.
2. gRPC load balancing
gRPC relies on HTTP/2, multiplexing many requests over a single, long-lived TCP
connection. Standard Kubernetes Services operate at Layer 4 (TCP) using
kube-proxy, so they balance connections, not individual requests. When an
SDK or a local Collector or application connects to a Gateway via a standard
Kubernetes Service, it establishes one TCP connection and holds it open
indefinitely. As a result, 100% of the telemetry from that agent will stream to
a single Gateway pod.
In high-throughput environments, this creates hot spots on specific Gateway replicas as newly scaled pods receive no traffic, undermining horizontal pod autoscaling and risking resource exhaustion on long-lived pods.
To distribute telemetry evenly, platform teams should consider one of the three following patterns:
Client-side load balancing
OTLP gRPC exporters can perform the load balancing on the client side, querying Kubernetes DNS to discover the IPs of all available Gateway pods and distribute requests in a round-robin fashion across them.
To achieve this, the Gateway tier should be deployed with a Headless Service so that DNS queries return a list of pod IPs rather than a single virtual IP.
- OpenTelemetry Operator: If you deploy an
OpenTelemetryCollectorCR instatefulsetmode, the Operator automatically generates a headless service named{collector-name}-collector-headless.{namespace}.svc.cluster.local. If deployed as adeployment, you will have to manually create a headless Kubernetes Service withClusterIP: None. - Helm Chart: Set
service.clusterIP: Nonewhen deploying the Gateway.
The sending OTLP exporter should be configured to use the DNS resolver and the round-robin balancer. When configuring the OTLP exporter on the Collector (e.g. from local Collector to a Gateway):
endpointmust start withdns:///to instruct the gRPC client to perform continuous DNS resolution.balancer_nameshould be set toround_robin(default in the Collector sincev0.105.0).
Specific client SDKs may configure gRPC clients in different ways. Refer to individual client implementations to configure client-side gRPC load balancing.
Layer 7 Proxy / Service Mesh
In this approach, an HTTP/2-aware Layer 7 proxy is placed between the OTLP gRPC exporter and the Gateway tier. Because the proxy operates at Layer 7, it understands the HTTP/2 frames. It accepts the single long-lived TCP connection, inspects the individual gRPC requests, and distributes them evenly across all backend Gateway pods.
Implementation Methods:
- Service Mesh (e.g., Istio, Linkerd): If the cluster already runs a service mesh, gRPC load balancing is handled automatically. The mesh sidecar (or equivalent) intercepts the egress traffic from the edge agent and balances it across the Gateway pods.
- Standalone Proxy (e.g., Envoy, NGINX): Deploy an Envoy or NGINX proxy
(configured for
grpc_pass) directly in front of the Gateway tier. The edge agents point to the Proxy’s Kubernetes Service, and the Proxy balances the traffic to the Gateways. - Ingress Controllers: If SDKs or local Collectors are sending telemetry from outside the cluster (or across clusters), ensure the Ingress Controller (e.g., NGINX Ingress, Traefik, AWS ALB) is explicitly configured to support gRPC and HTTP/2 backend routing.
Server-side connection recycling
Lastly, the Gateway OTLP gRPC receiver can be configured to close long-lived
connections after a set duration using
keepalive.server_parameters.max_connection_age. When a connection reaches this
age, the server sends a GoAway frame, forcing the client to reconnect. On
reconnection, standard Kubernetes Service routing redistributes the client
across available Gateway pods.
This option requires zero client-side changes — only the Gateway receiver configuration needs updating:
receivers:
otlp:
protocols:
grpc:
keepalive:
server_parameters:
max_connection_age: 60s
max_connection_age_grace: 10s
This approach is less precise than per-request balancing (traffic is only redistributed on reconnection, not per-request), and newly scaled pods will not receive traffic until existing connections expire. However, it is the simplest option as it requires no headless services, DNS resolvers, or L7 proxies.
Recommendation
If the organization already runs a Service Mesh, the L7 proxy option requires zero configuration on the OpenTelemetry side. If there is no Service Mesh and traffic remains within the same cluster, client-side load balancing provides the most precise distribution. Server-side connection recycling is the simplest starting point when neither is available. Alternatively, including cases where operators don’t have control over the receiving backend (e.g. connections routed via public internet), consider using OTLP/HTTP (see Action 2) which operates over HTTP/1.1 or short-lived HTTP/2 connections and does not suffer from the same pinning behavior.
3. SDK config overrides
Depending on the methods established in Action 1 for providing OTel configuration, the platform team must document exactly how developers inherit the baseline and how they can extend it:
- OpenTelemetry Operator: The platform team provisions a central
InstrumentationCR in the cluster. Application owners can opt in or out via pod/namespace annotations.- Basic overrides: Application owners can override specific baseline properties by injecting standard environment variables directly into their Pod spec. The compliance matrix details support for different environment variables per language. Additionally, some language implementations (e.g. Java) support configuring instrumentation libraries via library-specific environment variables.
- Complex overrides: If teams need to modify the
InstrumentationCR itself (e.g., to add custom samplers or specific auto-instrumentation libraries), the platform team should manage the CR via Helm or Kustomize. This allows the platform to maintain a base template while application owners provide local overrides or value files that are merged before deployment into the cluster.
- Base container images: Similarly to above, teams may override specific aspects via environment variables overriding defaults set in the base image.
- Internal libraries: Internal shared libraries should provide the necessary
hooks for users to pass in standard configuration blocks as required. For
instance, in JavaScript a wrapper library to set up a Node SDK should allow
the user to provide standard NodeSDKConfiguration configurations like
resourceortraceExporter. - Declarative configuration: Platform teams may utilize the environment
variable interpolation features of the file-based configuration and allow
application owners to set local environment variables that the base YAML file
reads, or, as the file-based configuration standard matures, use configuration
merging to blend a developer-provided
custom-otel.yamlwith the platform’sbase-otel.yaml.
4. Deployment notes on file_storage extension
While using the file_storage extension and OTLP exporter’s
sending_queue.storage provides added completeness guarantees, it moves away
from stateless deployments, requiring the Gateway to be deployed as a
StatefulSet with PersistentVolumeClaims. As with OTLP exporter settings,
operators must consider the trade-offs between the criticality of dropping data
(in this case potentially stale data) and the cost of maintenance toil and
support (e.g., managing disk pressure, volume resizing, etc).
Additionally, enabling persistent queues delays backpressure propagation to
downstream clients. Data is buffered to disk before memory pressure builds,
meaning the memory_limiter will not trigger early backpressure. Once the
persistent queue is full, the pipeline will block and the receiver will return
retryable errors (e.g. 429) to clients, signaling backpressure.
Operators must size the persistent queue and monitor disk usage accordingly.
5. Deployment notes on k8s_attributes processor
If a proxy is placed between the pod emitting telemetry and the Collector
processing it, ensure pass-through mode is enabled on the proxy so the Gateway
sees the original Pod IP of the application, not the proxy’s IP. Alternatively,
inject fields available via Downward API (e.g. k8s.pod.uid) as Resource
attributes when configuring OTel SDKs, and configure k8s_attributes pod
association rules to match the incoming Resource attributes with a given Pod.
When using the k8s_attributes processor, the ServiceAccount used by the
Collector must be granted RBAC get, watch, and list permissions on the
Kubernetes resources corresponding to the attributes being extracted, e.g.
deployments for k8s.deployment.name and k8s.deployment.uid. Missing any of
these will result in the Collector silently skipping enrichment for the affected
attributes.
Finally, running this processor in a Gateway will cause higher memory
utilization on each of the Collectors, scaling according to cluster size. The
k8s_attributes processor keeps metadata in memory related to Objects in the
cluster, and the more Objects there are to cache, the more memory the collector
will consume.
When run as a Gateway, each pod in the collector Deployment or StatefulSet has to remember ALL the metadata about the entire cluster (as opposed to running as a DaemonSet, where the pod only has to know about the metadata for its own node). The component documentation contains more details on deployment and scaling considerations.
Feedback
Was this page helpful?
Thank you. Your feedback is appreciated!
Please let us know how we can improve this page. Your feedback is appreciated!