Introducing OTel Blueprints and Reference Implementations
It’s not uncommon for end users adopting OpenTelemetry to, at some point in their journey, ask themselves: “Why is this stuff so complex?”. Full adoption normally requires understanding the different ways of configuring SDKs, multiple Collector deployments, data pipelines, instrumentation libraries, semantic convention registries, APIs for manual instrumentation across many different programming languages, and many other moving pieces.
These moving pieces don’t operate in isolation either. They need to work well together as part of a consolidated solution to describe an organization’s software systems using standard, high-quality telemetry. Failing to do so risks ending up with the very problem that OpenTelemetry was designed to solve: disjointed telemetry with disparate semantic conventions in use across the stack, lack of context propagated between services and signals, unnecessarily high data volumes… In general, poor quality telemetry, the opposite of what we need.
As the project evolved and stabilized, and as more end users adopted OpenTelemetry in large-scale production environments, we kept hearing the same feedback: end users want a prescriptive, opinionated way of “deploying OpenTelemetry” (what this means is up to interpretation), as recommended by the project and its maintainers. They want to follow a set of steps to configure the components they need to solve their observability challenges in the simplest way, and not more.
You spoke, and we listened. I’m pleased to announce a new initiative driven by the End User SIG in collaboration with the Developer Experience SIG: Blueprints and Reference Implementations.
The source of complexity and the need for blueprints
Let’s go back to that first question we asked: “Why is this stuff so complex?. Using the terms described by Fred Brooks in his paper titled No Silver Bullet—Essence and Accident in Software Engineering, written back in 1986, the complexity of adopting OTel is two-fold: essential and, more frequently than not, accidental.
Essential complexity
The essential part of OTel’s complexity, the one inherent in its design, mostly comes down to its breadth and cross-cutting nature. OpenTelemetry touches nearly all parts of the stack, from client-side (i.e. browser and mobile), to applications, Kubernetes, infrastructure, databases, etc. Our documentation is great at explaining how each of these individual components work, and new developments like Declarative Configuration or the Injector, and the long-existing OpenTelemetry Operator, have made it easier to apply a consolidated set of configuration across all these components. However, the fact remains that this is still a very large deployment surface in which one needs to achieve consistency and which, in most cases, cannot even be handled by a single team.
OpenTelemetry is also designed to work with any backend, not limited to a single solution. The old model of dropping a pre-built agent into your stack and seeing data flow may be appealing, but it lacks the flexibility needed in modern systems that need to remain data sovereign. OpenTelemetry’s flexibility puts end users in control of their own data, regardless of how that data is generated and ultimately stored, but this flexibility paired with its breadth can add further complexity.
In summary, OTel can be essentially complex when applied at scale, and this is normally for good reasons.
Accidental complexity
The accidental part of OTel adoption complexity, as in most tooling adoption, comes mostly from humans. When multiple teams start to organically adopt OpenTelemetry across different parts of an organization, without a shared strategy and vision, and with no communication between groups, standards suffer. Some team may be configuring their SDKs with a configuration that’s incompatible with the Collector Gateway deployed by another team, or they may be propagating context in a different way than the dependencies they call, breaking context propagation for both.
Unfortunately, AI won’t save us here, and it might even make matters worse. We have all heard stories of systems where entropy and complexity have accidentally grown uncontrolled as AI-assisted development adds a new file here, a duplicated method there, or, in the case of OTel, a new way of configuring and deploying a component. The result is a system that’s neither effective nor efficient at describing itself with high-quality telemetry across all its different layers and dependencies.
The role of blueprints in taming complexity
The reality is that, as Fred Brooks stated, there’s no “silver bullet”. We cannot simply eliminate the essential complexity of modern observability tooling and just say “this is the one and only way to deploy OTel”, as every environment and organizational structure is different. However, we can certainly aim to make sense of the breadth of the project to help those navigating OTel adoption, and together keep that accidental complexity at bay!
This is where OTel Blueprints come in. The structure of these blueprints is based upon best practices in strategic thinking, and their content is guided by end user experience, including that contained in reference implementations shared by adopters at some point along their OTel journeys.
The primary focus is on identifying the most critical challenges to solve in a particular environment, and scope our solutions to those alone, removing any unnecessary complexity.
With OTel Blueprints, we aim to categorize the most common observability challenges that organizations face across different scenarios, and propose a set of general design patterns and best practices that have been proven to solve them. For instance, there are many common challenges that end users aim to solve by providing a consolidated SDK config and Collector Gateways in Kubernetes environments, instrumenting infrastructure and applications in non-Kubernetes estates, or monitoring Kubernetes clusters along with well-known control plane workloads.
For end users (AI-assisted or not), blueprints will provide a set of common scenarios and environments with which they can identify, and immediate, actionable guidance on how to deploy best practices across multiple components, all working together as part of a consolidated strategy.
OpenTelemetry maintainers will also be able to use blueprints and reference implementations as a way to identify any possible pockets of friction in adoption which could be further simplified via enhanced tooling.
What to expect from blueprints
OTel Blueprints will not rewrite existing documentation. You will not see a blueprint that tells you how to configure an SDK, or how to deploy a Collector in its different deployment patterns. That’s already well covered within our docs.
The goal of blueprints is to provide a holistic approach that readers can use to inform their observability strategies, tying together different components, solutions, and best practices, pointing to relevant documentation as necessary.
We will soon be publishing blueprints under the new Blueprints section of our website. However, in the meantime, we can use our standard blueprint template to illustrate what you can expect from the future blueprints that will follow.
In essence a blueprint will have the following building blocks:
- Summary: As an end user, you will be able to quickly see if you may be the target audience for this blueprint, or if it applies to your environment.
- Common Challenges: This scopes the problems to solve in a particular environment. If something is not identified as a problem to solve, the blueprint will not propose a solution for it (other blueprints may do so).
- General Guidelines: The best practices and design patterns that will solve the challenges in scope. You can expect architecture diagrams here, and a clear vision of how it all fits together.
- Implementation: The list of actions to implement the prescribed guidelines, pointing to relevant existing documentation.
We don’t expect a single blueprint to solve everyone’s needs. Instead, we want to provide well-scoped, actionable strategies that deliver tangible value to end users, and acknowledge that blueprints will connect between each other.
As illustrated in the figure below, some blueprints may overlap with each other, containing the same design patterns, e.g. deploying a Collector Daemonset using the OpenTelemetry Operator. A given blueprint may also clearly call a specific problem to solve as out of scope, e.g. audit logging for a centralized observability platform, expecting another blueprint to extend it. More commonly, blueprints may be related to each other, e.g. a blueprint for Kubernetes observability may assume a central Collector Gateway as proposed in another blueprint.
flowchart TD
A[Blueprint A]
B[Blueprint B]
C[Blueprint C]
D[Blueprint D]
A -.->|Extends| C
B -.->|Relates to| D
A <-->|Overlaps| BLastly, you can also expect blueprints to evolve over time. As tooling evolves, the way to approach a specific problem may change, and blueprints will continue to reflect the simplest and most efficient way of doing it.
Grounding blueprints in reference implementations
Blueprints do not come out of the blue (seriously, no pun intended). They are contributed by experts in the field, end users and solutions/observability architects that have experienced OTel adoption first hand and can share design patterns that work at scale.
The nature of blueprints is to be useful to the largest group of individuals and organizations as possible. As such, there needs to be a certain degree of generalization, grouping shared experience into a single narrative. However, we think it’s crucial that blueprints are grounded on fact, and not simply theoretical advice. From the start, we wanted to have blueprints backed by evidence in the form of reference implementations.
Reference implementations are snapshots in time that show how real-world organizations have approached OpenTelemetry adoption. They will naturally implement some (or all) of the advice in one (or many) blueprints.
flowchart BT
%% Define the nodes
BA[Blueprint A]
BB[Blueprint B]
BC[Blueprint C]
RA[Reference Imp A]
RB[Reference Imp B]
RC[Reference Imp C]
%% Define the relationships
RA -->|Implements| BA
RB -->|Implements| BA
RB -->|Implements| BB
RC -->|Implements| BB
RC -->|Implements| BCAdobe, Mastodon, and Skyscanner have already shared how they’ve approached OpenTelemetry adoption across their environments. This work has been diligently driven by the Developer Experience SIG, supporting those end users in sharing their stories, and has cemented much of the way for OTel Blueprints to be successful. I would like to personally thank the DevEx SIG for this effort!
These reference implementations have now been published in the new Reference implementations section of our website. We have also put together a standard template to facilitate end users sharing their stories in the future. The more, the merrier!
Now more than ever, we want your input!
All this work would’ve not been possible without end users giving us feedback, sharing their adoption journeys, contributing their expertise to the project, and ultimately helping to shape the future of observability.
However, end users, we are once again calling for your support! Firstly, to give any feedback you may want to contribute on the three blueprints in progress, which are the current focus of the End-User SIG: instrumentation for infrastructure and processes in non-Kubernetes environments, Kubernetes observability, and centralized telemetry platform.
Secondly, and most importantly, to share your experience! We would like to have many other reference implementations across different industries and environments, and proposals for new blueprints helping other end users adopt best practices in observability. If you want to continue helping us to scale adoption of best practices in OpenTelemetry, you can see how to contribute to this effort in our documentation. Here’s a quick summary of the contribution process:
flowchart LR
B1([Want to share blueprint])
B2([Want to get blueprint])
B3[Open sig-end-user issue]
B4[Collaborate to scope common challenges]
B5[Collaborate to craft blueprint]
B6[Review blueprint]
B7([Blueprint published])
B1 --> B3
B2 --> B3
B3 --> B4
B4 --> B5
B5 --> B6
B6 --> B7flowchart LR
R1([Want to share reference implementation])
R2[Open sig-end-user issue]
R3[Collaborate to craft reference implementation]
R4[Review reference implementation]
R5([Reference implementation published])
R1 --> R2
R2 --> R3
R3 --> R4
R4 --> R5This is your chance to make your end user journey a part of the OpenTelemetry journey!