Infrastructure and Processes in Non-K8s Environments
You are viewing the English version of this page because it has not yet been fully translated. Interested in helping out? See Contributing.
Summary
This blueprint outlines a strategic reference for Platform Engineering and SRE teams operating in traditional virtual machine (VM), bare metal, and on-premises environments, including scenarios where containers are run directly on an operating system without an orchestrator like Kubernetes.
It addresses the friction often found when attempting to establish consistent observability across heterogeneous infrastructure, legacy processes, and containerized workloads.
By implementing the patterns in this blueprint, organizations can expect to achieve the following outcomes:
- Out-of-the-box, high-quality telemetry for applications and services running in non-Kubernetes environments, including directly managed containers.
- Consistent lifecycle management for OpenTelemetry agents, together with standardized bootstrap and configuration patterns for SDK-based instrumentation.
- Unified observability across mixed infrastructure: VMs, bare metal, and containers without an orchestrator.
- Improved governance over telemetry signal quality, data enrichment, routing, and export pipelines.
- Reduced manual toil and cognitive load for developers and operators.
Background
Many organizations maintain a blend of legacy infrastructure, VMs, bare metal servers, and direct-to-runtime container deployments, in addition to or instead of Kubernetes. These environments can be complex and often lack the automation and standardization provided by orchestrators. Ensuring consistent, high-quality observability in these scenarios is critical, yet frequently hampered by fragmented tooling and manual processes.
Open Agent Management Protocol (OpAMP) provides a standardized way to remotely manage, configure, and monitor OpenTelemetry agents across diverse infrastructure where supported. At the time of writing, the OpAMP specification is in Beta, so organizations should evaluate implementation maturity and operational support before standardizing on a specific solution. Where OpAMP is not yet suitable, shared libraries, pre-baked images, centrally maintained configuration artifacts, and existing deployment or configuration management tooling can still provide consistent lifecycle management for SDK- and agent-based instrumentation.
Common challenges
Organizations operating in non-Kubernetes environments typically face a set of challenges that hinder effective observability. Without built-in automation, standardization, and centralized management, these environments often find it difficult to ensure consistent, high-quality telemetry across a diverse landscape of infrastructure and applications. Many of these observability challenges are not unique to non-Kubernetes environments. In non-Kubernetes environments, however, teams often have fewer built-in platform mechanisms for rollout management, service discovery, metadata enrichment, and centralized policy distribution.
1. Limited automation for telemetry deployment and management
Instrumentation and agent deployment on VMs, bare metal, and directly managed containers is often manual or script-based, and ongoing configuration is difficult to manage at scale. This decentralized, ad hoc approach typically requires operators or developers to install, configure, and update OpenTelemetry agents individually on each host or workload.
This leads to:
- High toil: New workloads or hosts require repeated, error-prone configuration steps.
- Slow rollout and update cycles: Updates to instrumentation or configuration are slow and difficult to propagate fleet-wide.
- Operational risk: Rollbacks, version control, and health monitoring are harder to perform consistently across the estate.
This automation gap also sets the stage for the fragmentation described in Challenge 2: when each host or workload is configured manually, teams tend to make independent choices about agents, SDKs, and exporters that are then difficult to reconcile.
2. Fragmented instrumentation approaches
Building on the lack of automation described in Challenge 1, the absence of standardized deployment and management patterns leads teams to adopt different OpenTelemetry agents, SDKs, or exporters for host-based and containerized workloads.
This leads to:
- Inconsistent semantic conventions: Telemetry signals may lack standard
resource attributes such as
service.name,host.id,host.name,container.id, anddeployment.environment, making cross-system correlation difficult. - Divergent instrumentation behavior: Different teams may apply different defaults for sampling, propagation, resource detection, or export, producing uneven telemetry quality.
- Manual configuration drift: Host- and container-based agents frequently require manual configuration, resulting in drift and an increased risk of errors.
3. Siloed data processing and export
The automation gap (Challenge 1) and the resulting fragmentation (Challenge 2) compound at the data pipeline layer: data collection and export pipelines are often set up per application, per host, or per team. In the absence of centralized management, individual teams may independently configure telemetry agents, exporters, and data processing logic for each workload or environment.
This leads to:
- Duplicated effort: Teams may duplicate data enrichment, filtering, and routing logic across environments.
- Inconsistent policy enforcement: Redaction, retry behavior, batching, and routing policies may vary between teams.
- Lack of visibility: Operations and governance teams lack unified control over what telemetry is collected and how it is processed or exported.
General guidelines
To address the challenges described above, organizations should adopt a set of strategic guidelines designed to streamline observability practices across diverse environments. These guidelines provide a foundation for standardizing instrumentation, automating agent management, and ensuring consistent data quality.
1. Centrally manage agent lifecycle while allowing controlled customization
Challenges addressed: 1, 2
Use OpAMP, where supported and operationally suitable, to centrally manage OpenTelemetry agents running as system services or service containers.
Because the OpAMP specification is currently in Beta, organizations should evaluate the maturity and supportability of the implementations available in their environment before standardizing on a specific solution. Where OpAMP is not supported or not yet suitable, organizations should use other centralized management mechanisms, such as configuration management tools, golden images, or standardized deployment artifacts, to maintain consistent agent deployment, configuration, and lifecycle management.
Regardless of the management mechanism, platform teams should own the baseline agent distribution, required processors and exporters, security settings, health reporting, and default resource detection behavior.
At the same time, organizations should explicitly define how environment-specific or workload-specific customization is allowed. A practical model is to use a layered configuration approach:
- A platform-owned baseline, typically owned by platform engineering, for mandatory defaults, security controls, and organization-wide processors/exporters.
- An environment overlay, typically owned by infrastructure or environment operators, for differences such as endpoints, tenancy, deployment environment, site-specific metadata, or network-specific settings.
- A workload overlay, typically owned by application teams within platform-defined guardrails, for approved variations such as opt-in receivers, additional resource attributes, or safe tuning parameters.
This creates a clear boundary between standardization and flexibility: teams can extend approved parts of the configuration without creating one-off, unmanaged deployments.
By implementing this guideline, organizations can expect to achieve:
- Automated, consistent telemetry configuration across all environments.
- Reduced manual errors and simplified onboarding for new workloads.
- Faster, safer upgrades and rollbacks of agent configurations.
- A controlled mechanism for local customization without sacrificing central governance.
2. Centralize telemetry collection and processing through an OpenTelemetry Collector gateway layer
Challenges addressed: 2, 3
Deploy one or more OpenTelemetry Collector gateways as aggregation points for telemetry data from hosts and directly managed containers. In this context, “centralized” does not necessarily mean a single global deployment. Depending on organizational structure, network boundaries, isolation requirements, and traffic patterns, the gateway layer may be implemented at different levels, such as per region, per site, per environment, or per cloud account, while still providing centralized policy enforcement within that scope.
In non-Kubernetes environments, these gateways can be deployed using several patterns, depending on scale and operational model, including:
- Dedicated gateway VMs or bare metal hosts.
- A service pool behind a load balancer.
- Containerized gateway services running on general-purpose compute.
- Regional or site-local gateways for distributed environments.
The diagram below shows the gateway tier with its alternative deployment patterns. Each box represents a separate way to implement the gateway role; an organization typically picks one pattern, or combines patterns across sites:
flowchart TD
S[Telemetry Sources<br/>Hosts and Containers]
B[(Observability<br/>Backends)]
subgraph G1["Pattern 1"]
P1[Dedicated VMs /<br/>Bare Metal Hosts]
end
subgraph G2["Pattern 2"]
P2[Service Pool Behind<br/>Load Balancer]
end
subgraph G3["Pattern 3"]
P3[Containerized Gateways<br/>on General Compute]
end
subgraph G4["Pattern 4"]
P4[Regional / Site-Local<br/>Gateways]
end
S --> G1 --> B
S --> G2 --> B
S --> G3 --> B
S --> G4 --> BBy implementing this guideline, organizations can expect to achieve:
- Unified control over data processing, enrichment, and export pipelines.
- Simplified governance and easier implementation of organization-wide policies.
- Fewer direct connections from hosts or applications to external observability backends, which can simplify firewall and network policy management.
- Better resilience and scalability than per-host or per-application export topologies.
- Clear separation between local collection and centralized policy enforcement.
3. Standardize resource attribution and distribute reusable instrumentation building blocks
Challenges addressed: 2
Define an organization-wide telemetry standard for resource attribution and ensure it is applied consistently across all workloads. This should not rely only on documentation; it should be delivered through reusable building blocks such as:
- Pre-baked agent images.
- Packaged language agents and auto-instrumentation artifacts where applicable.
- Standardized OpenTelemetry Collector binaries or distributions.
- Shared libraries or starter packages for SDK-based instrumentation.
- Standard startup wrappers or environment-variable conventions.
- Centrally maintained configuration snippets or templates.
The recommended resource model for non-Kubernetes environments should align with OpenTelemetry resource semantic conventions and rely on automatic resource detection wherever possible. In OpenTelemetry, a resource identifies the entity that produced the telemetry, such as a host, VM, process, container, or service instance. In practice, organizations should ensure that telemetry can be correlated across the following resource domains, using automatically detected attributes where supported:
- Host
- Device (where applicable)
- Process
- Process runtime
- Operating system
- Container (where applicable)
- Service identity
Rather than manually maintaining all of the corresponding attributes, organizations should prefer existing instrumentation and resource detection for host, process, runtime, OS, and container metadata, and use shared configuration or bootstrap artifacts to keep that detection enabled consistently.
The area that usually requires the most deliberate standardization is service
identity. Organizations should ensure that service telemetry uses the
appropriate
service semantic conventions,
with attributes such as service.name, and, where relevant, service.version,
service.namespace, service.instance.id, and deployment.environment.
Which service attributes should be present depends on how workloads are deployed
and identified in the environment. For example, service.namespace may be
useful to distinguish services across organizational or platform boundaries,
while service.instance.id may be needed to distinguish replicated instances of
the same service.
Application telemetry should include enough service and infrastructure context to support correlation with host- and infrastructure-level telemetry, using semantic conventions as the source of truth for which identifying attributes apply to each resource type.
By implementing this guideline, organizations can expect to achieve:
- Improved correlation and searchability of telemetry data across systems.
- Easier analysis and troubleshooting regardless of infrastructure type.
- Consistent metadata quality without requiring every team to reinvent instrumentation patterns.
- Faster adoption through reusable, supported building blocks.
Implementation
Translating these guidelines into practice requires a combination of automation, standardized tooling, and centralized management. The implementation steps below are written as roadmap items, with checklist-style actions that organizations can plan and execute in sequence.
1. Define a baseline telemetry standard and layered configuration model
Guidelines supported: 1, 3
Define the minimum required telemetry standard for the organization and document which parts of telemetry configuration are centrally owned versus locally customizable. This includes the supported configuration for host or service agents, language agents where applicable, SDK-based instrumentation, and OpenTelemetry Collectors used for local collection or forwarding. Where OpAMP is used, it should be aligned with this model so centrally managed agent configuration follows the same baseline and approved overlays. This is the foundation for consistency at scale.
Checklist:
- Define the mandatory resource attributes and signal conventions that all workloads must emit.
- Define the baseline configuration for supported telemetry components, including agents, language agents where applicable, SDKs, and OpenTelemetry Collectors.
- Standardize exporters, authentication, TLS, health reporting, and default processors according to the role of each component.
- Define the allowed extension points for environment-specific and workload-specific customization.
- Version all baseline and overlay configurations so they can be rolled out and rolled back safely.
- Publish ownership boundaries so teams know what they can and cannot modify.
Documentation:
2. Stand up an OpAMP management plane for agents
Guidelines supported: 1
Provide a central OpAMP management capability to manage agent configuration, status reporting, health monitoring, and controlled rollouts for supported agents. This blueprint recommends the management pattern, not any specific example server implementation; organizations should use a production-ready solution appropriate for their environment.
Checklist:
- Decide which agents or OpenTelemetry Collector deployments will be managed through OpAMP, based on the capabilities of the distributions used in your environment (for example, whether they are upstream or vendor-specific, whether they embed an OpAMP client, whether they use an OpAMP Supervisor for the OpenTelemetry Collector, and whether OpAMP is compiled in or packaged separately), and the components you want to manage centrally.
- Stand up or adopt a production-ready OpAMP server or management endpoint, following the Collector management documentation, with appropriate authentication, authorization, and transport security.
- Configure agents or supervisors to enroll with the management service and report their identity, capabilities, health, and effective configuration status, as defined by the OpAMP specification.
- Organize agents into logical groups such as development, staging, production, region, or environment so configuration changes can be rolled out in stages.
- Define how configuration updates are promoted between rollout groups and how failed changes are detected and reverted.
- Monitor management-plane health, agent connectivity, update status, and configuration drift to ensure the fleet remains under control.
OpAMP can be integrated with the OpenTelemetry Collector in two ways, with different scopes of capability:
Using an OpAMP Supervisor — a separate host-local process that manages the Collector lifecycle and applies remote configuration in addition to reporting status, health, and capabilities. This is the most capable integration and is suitable when OpAMP is the primary mechanism for configuration and lifecycle management.
Using the Collector’s built-in OpAMP extension — a lightweight integration that reports status, health, and capabilities to the management service, but does not handle remote configuration delivery or lifecycle management. This is suitable when configuration and lifecycle are already handled by other tooling (for example, configuration management, golden images, or packaged distributions) and OpAMP is used primarily for centralized observability of the agent fleet.
In a Supervisor-based deployment, the management service communicates with the host-local Supervisor, which then manages the lifecycle and configuration of the local agent or Collector. The diagram below illustrates the Supervisor pattern; the extension-only pattern looks similar but with the Supervisor box removed and the management service communicating directly with the Collector for status reporting only.
flowchart TB
M[Central OpAMP management service]
subgraph H1[Host or VM]
S1[OpAMP Supervisor]
A1[OpenTelemetry agent or Collector]
S1 --> A1
end
subgraph H2[Host or VM]
S2[OpAMP Supervisor]
A2[OpenTelemetry agent or Collector]
S2 --> A2
end
subgraph H3[Host or VM]
S3[OpAMP Supervisor]
A3[OpenTelemetry agent or Collector]
S3 --> A3
end
M -->|Configuration and lifecycle management| S1
M -->|Configuration and lifecycle management| S2
M -->|Configuration and lifecycle management| S3
S1 -->|Status, health, and capabilities| M
S2 -->|Status, health, and capabilities| M
S3 -->|Status, health, and capabilities| M
A1 --> O[Observability backend]
A2 --> O
A3 --> ODocumentation:
3. Package and deploy standardized agents and SDK bootstrap artifacts
Guidelines supported: 1, 3
Use configuration management and image packaging to deliver supported telemetry components consistently across hosts and containerized workloads. Prefer declarative configuration for SDK-based instrumentation where it is supported. Where it is not available, standardize environment variables, startup options, or SDK-specific configuration patterns so teams inherit consistent defaults with minimal manual setup.
Common packaging patterns include standard system-service packages for host-based agents, pre-baked container images for Collector or agent deployments, and shared SDK bootstrap artifacts such as starter packages or startup wrappers. For example:
- A host-based Collector can be installed as a standard system service with a centrally managed configuration file and environment file.
- A containerized deployment can use a pre-baked image that includes the approved Collector binary, extensions, and default configuration.
- SDK-based instrumentation can be distributed through shared startup wrappers, language agents, or starter packages that apply organization-approved defaults automatically.
Checklist:
- Package host-based agents as standard system services.
- Provide pre-baked images or service-container definitions for containerized deployments.
- Publish shared libraries, starter packages, or bootstrap wrappers for supported SDK languages.
- Prefer declarative SDK configuration where supported, and otherwise standardize environment variable, startup, and configuration file conventions across environments.
- Validate that new workloads inherit the baseline configuration by default.
Documentation:
- OpenTelemetry Collector configuration
- OpenTelemetry language SDKs and APIs
- OpenTelemetry zero-code instrumentation
- Java zero-code instrumentation
- Java agent configuration
4. Deploy an OpenTelemetry Collector gateway layer
Guidelines supported: 2
Deploy one or more OpenTelemetry Collector gateways as the central processing and export tier. Choose a topology appropriate for the environment, such as dedicated VMs, service pools behind a load balancer, or regional gateway nodes.
Checklist:
- Select the gateway deployment topology for each environment.
- Define how local agents discover and connect to gateways.
- Configure processors for batching, memory protection, enrichment, retry, and routing.
- Separate lightweight ingest from heavier centralized processing where scale requires it.
- Define high-availability and failover behavior for gateways.
- Validate end-to-end routing to observability backends.
A common pattern in non-Kubernetes environments is to run host-level Collectors on individual hosts and forward telemetry to a site-local or region-local Collector gateway. That gateway layer should be horizontally scalable and highly available, using appropriate load-balancing and failover mechanisms for the environment, such as cloud native scaling patterns or equivalent HA setups in non-cloud infrastructure.
flowchart TB
subgraph S1[Site or Business Unit A]
A1[VM or Bare Metal Host]
A2[VM or Bare Metal Host]
A3[Host running directly managed containers]
AC1[Host-level Collector]
AC2[Host-level Collector]
AC3[Host-level Collector]
A1 --> AC1
A2 --> AC2
A3 --> AC3
AC1 --> AG[Site-local Collector gateway]
AC2 --> AG
AC3 --> AG
AG --> P1[Site-local processing and policy enforcement]
end
subgraph S2[Site or Business Unit B]
B1[VM or Bare Metal Host]
B2[Host running directly managed containers]
BC1[Host-level Collector]
BC2[Host-level Collector]
B1 --> BC1
B2 --> BC2
BC1 --> BG[Site-local Collector gateway]
BC2 --> BG
BG --> P2[Site-local processing and policy enforcement]
end
P1 --> O[Observability backend]
P2 --> ODocumentation:
5. Enforce resource attribution and correlation standards
Guidelines supported: 1, 3
Ensure that all telemetry includes the required metadata for correlation across infrastructure and application layers.
Checklist:
- Define which host, process, runtime, operating system, and container resources should be detected and correlated, and enable automatic resource detection for them consistently through shared configuration, bootstrap artifacts, or centrally maintained templates.
- Ensure application telemetry includes sufficient infrastructure context for
correlation, such as
host.idorhost.namewhere appropriate, for example through SDK configuration, language agents, startup wrappers, or automatic resource detection where supported. - Enable and validate resource detectors wherever supported so host, OS, process, runtime, and container metadata is populated automatically and consistently, and supplement detector output with centrally defined attributes where needed.
- Validate emitted telemetry against the defined attribute standard by checking representative telemetry from hosts and applications before broad rollout.
- Add conformance checks to deployment pipelines or post-deployment validation steps so new workloads can be verified against the expected attribute set.
This blueprint focuses on the attribute requirements and workload-level practices needed to emit them consistently. More detailed centralized enforcement and normalization patterns are out of scope for this section and may be covered separately.
Documentation:
6. Centralize governance, policy enforcement, and change management
Guidelines supported: 2, 3
Use the Collector gateway layer and centrally owned configuration to enforce organization-wide rules for processing, routing, and exporting telemetry.
Checklist:
- Define approved exporters and backend destinations.
- Centralize redaction, filtering, enrichment, and routing policies.
- Define standard retry, batching, and sampling policies.
- Establish an exception process for workloads that need non-default behavior.
- Review telemetry quality and policy compliance regularly.
Documentation:
Reference architectures
The patterns described in this blueprint have been successfully implemented by the following end-users:
Coming soon!
Have you implemented an architecture for this blueprint? Share your experience or a link to your article by opening an issue in the End User SIG GitHub repository.
フィードバック
このページは役に立ちましたか?
Thank you. Your feedback is appreciated!
Please let us know how we can improve this page. Your feedback is appreciated!