Supplementary Guidelines

Note: this document is NOT a spec, it is provided to support the Metrics API and SDK specifications, it does NOT add any extra requirements to the existing specifications.

Guidelines for instrumentation library authors

Instrument selection

The Instruments are part of the Metrics API. They allow Measurements to be recorded synchronously or asynchronously.

Choosing the correct instrument is important, because:

  • It helps the library to achieve better efficiency. For example, if we want to report room temperature to Prometheus, we want to consider using an Asynchronous Gauge rather than periodically poll the sensor, so that we only access the sensor when scraping happened.
  • It makes the consumption easier for the user of the library. For example, if we want to report HTTP server request latency, we want to consider a Histogram, so most of the users can get a reasonable experience (e.g. default buckets, min/max) by simply enabling the metrics stream, rather than doing extra configurations.
  • It generates clarity to the semantic of the metrics stream, so the consumers have better understanding of the results. For example, if we want to report the process heap size, by using an Asynchronous UpDownCounter rather than an Asynchronous Gauge, we’ve made it explicit that the consumer can add up the numbers across all processes to get the “total heap size”.

Here is one way of choosing the correct instrument:

  • I want to count something (by recording a delta value):
    • If the value is monotonically increasing (the delta value is always non-negative) - use a Counter.
    • If the value is NOT monotonically increasing (the delta value can be positive, negative or zero) - use an UpDownCounter.
  • I want to record or time something, and the statistics about this thing are likely to be meaningful - use a Histogram.
  • I want to measure something (by reporting an absolute value):
    • If it makes NO sense to add up the values across different sets of attributes, use an Asynchronous Gauge.
    • If it makes sense to add up the values across different sets of attributes:

Additive property

Monotonicity property

In the OpenTelemetry Metrics Data Model and API specifications, the word monotonic has been used frequently.

It is important to understand that different Instruments handle monotonicity differently.

Let’s take an example with a network driver using a Counter to record the total number of bytes received:

  • During the time range (T0, T1]:
    • no network packet has been received
  • During the time range (T1, T2]:
    • received a packet with 30 bytes - Counter.Add(30)
    • received a packet with 200 bytes - Counter.Add(200)
    • received a packet with 50 bytes - Counter.Add(50)
  • During the time range (T2, T3]
    • received a packet with 100 bytes - Counter.Add(100)

You can see that the total increment during (T0, T1] is 0, the total increment during (T1, T2] is 280 (30 + 200 + 50), the total increment during (T2, T3] is 100, and the total increment during (T0, T3] is 380 (0 + 280 + 100). All the increments are non-negative, in other words, the sum is monotonically increasing.

Note that it is inaccurate to say “the total bytes received by T3 is 380”, because there might be network packets received by the driver before we started to observe it (e.g. before the last operating system reboot). The accurate way is to say “the total bytes received during (T0, T3] is 380”. In a nutshell, the count represents a rate which is associated with a time range.

This monotonicity property is important because it gives the downstream systems additional hints so they can handle the data in a better way. Imagine we report the total number of bytes received in a cumulative sum data stream:

  • At Tn, we reported 3,896,473,820.
  • At Tn+1, we reported 4,294,967,293.
  • At Tn+2, we reported 1,800,372.

The backend system could tell that there was integer overflow or system restart during (Tn+1, Tn+2], so it has chance to “fix” the data.

Let’s take another example with a process using an Asynchronous Counter to report the total page faults of the process:

The page faults are managed by the operating system, and the process could retrieve the number of page faults via some system APIs.

  • At T0:
    • the process started
    • the process didn’t ask the operating system to report the page faults
  • At T1:
    • the operating system reported with 1000 page faults for the process
  • At T2:
    • the process didn’t ask the operating system to report the page faults
  • At T3:
    • the operating system reported with 1050 page faults for the process
  • At T4:
    • the operating system reported with 1200 page faults for the process

You can see that the number being reported is the absolute value rather than increments, and the value is monotonically increasing.

If we need to calculate “how many page faults have been introduced during (T3, T4]”, we need to apply subtraction 1200 - 1050 = 150.

Semantic convention

Once you decided which instrument(s) to be used, you will need to decide the names for the instruments and attributes.

It is highly recommended that you align with the OpenTelemetry Semantic Conventions, rather than inventing your own semantics.

Guidelines for SDK authors

Aggregation temporality

Synchronous example

The OpenTelemetry Metrics Data Model and SDK are designed to support both Cumulative and Delta Temporality. It is important to understand that temporality will impact how the SDK could manage memory usage. Let’s take the following HTTP requests example:

  • During the time range (T0, T1]:
    • verb = GET, status = 200, duration = 50 (ms)
    • verb = GET, status = 200, duration = 100 (ms)
    • verb = GET, status = 500, duration = 1 (ms)
  • During the time range (T1, T2]:
    • no HTTP request has been received
  • During the time range (T2, T3]
    • verb = GET, status = 500, duration = 5 (ms)
    • verb = GET, status = 500, duration = 2 (ms)
  • During the time range (T3, T4]:
    • verb = GET, status = 200, duration = 100 (ms)
  • During the time range (T4, T5]:
    • verb = GET, status = 200, duration = 100 (ms)
    • verb = GET, status = 200, duration = 30 (ms)
    • verb = GET, status = 200, duration = 50 (ms)

Note that in the following examples, Delta aggregation temporality is discussed before Cumulative aggregation temporality because synchronous Counter and UpDownCounter measurements are input to the API with specified Delta aggregation temporality.

Synchronous example: Delta aggregation temporality

Let’s imagine we export the metrics as Histogram, and to simplify the story we will only have one histogram bucket (-Inf, +Inf):

If we export the metrics using Delta Temporality:

  • (T0, T1]
    • attributes: {verb = GET, status = 200}, count: 2, min: 50 (ms), max: 100 (ms)
    • attributes: {verb = GET, status = 500}, count: 1, min: 1 (ms), max: 1 (ms)
  • (T1, T2]
    • nothing since we don’t have any Measurement received
  • (T2, T3]
    • attributes: {verb = GET, status = 500}, count: 2, min: 2 (ms), max: 5 (ms)
  • (T3, T4]
    • attributes: {verb = GET, status = 200}, count: 1, min: 100 (ms), max: 100 (ms)
  • (T4, T5]
    • attributes: {verb = GET, status = 200}, count: 3, min: 30 (ms), max: 100 (ms)

You can see that the SDK only needs to track what has happened after the latest collection/export cycle. For example, when the SDK started to process measurements in (T1, T2], it can completely forget about what has happened during (T0, T1].

Synchronous example: Cumulative aggregation temporality

If we export the metrics using Cumulative Temporality:

  • (T0, T1]
    • attributes: {verb = GET, status = 200}, count: 2, min: 50 (ms), max: 100 (ms)
    • attributes: {verb = GET, status = 500}, count: 1, min: 1 (ms), max: 1 (ms)
  • (T0, T2]
    • attributes: {verb = GET, status = 200}, count: 2, min: 50 (ms), max: 100 (ms)
    • attributes: {verb = GET, status = 500}, count: 1, min: 1 (ms), max: 1 (ms)
  • (T0, T3]
    • attributes: {verb = GET, status = 200}, count: 2, min: 50 (ms), max: 100 (ms)
    • attributes: {verb = GET, status = 500}, count: 3, min: 1 (ms), max: 5 (ms)
  • (T0, T4]
    • attributes: {verb = GET, status = 200}, count: 3, min: 50 (ms), max: 100 (ms)
    • attributes: {verb = GET, status = 500}, count: 3, min: 1 (ms), max: 5 (ms)
  • (T0, T5]
    • attributes: {verb = GET, status = 200}, count: 6, min: 30 (ms), max: 100 (ms)
    • attributes: {verb = GET, status = 500}, count: 3, min: 1 (ms), max: 5 (ms)

You can see that we are performing Delta->Cumulative conversion, and the SDK has to track what has happened prior to the latest collection/export cycle, in the worst case, the SDK will have to remember what has happened since the beginning of the process.

Imagine if we have a long running service and we collect metrics with 7 attributes and each attribute can have 30 different values. We might eventually end up having to remember the complete set of all 21,870,000,000 permutations! This cardinality explosion is a well-known challenge in the metrics space.

Making it even worse, if we export the permutations even if there are no recent updates, the export batch could become huge and will be very costly. For example, do we really need/want to export the same thing for (T0, T2] in the above case?

So here are some suggestions that we encourage SDK implementers to consider:

  • You want to control the memory usage rather than allow it to grow indefinitely / unbounded - regardless of what aggregation temporality is being used.
  • You want to improve the memory efficiency by being able to forget about things that are no longer needed.
  • You probably don’t want to keep exporting the same thing over and over again, if there is no updates. You might want to consider Resets and Gaps. For example, if a Cumulative metrics stream hasn’t received any updates for a long period of time, would it be okay to reset the start time?

Asynchronous example

In the above case, we have Measurements reported by a Histogram Instrument. What if we collect measurements from an Asynchronous Counter?

The following example shows the number of page faults of each thread since the thread ever started:

  • During the time range (T0, T1]:
    • pid = 1001, tid = 1, #PF = 50
    • pid = 1001, tid = 2, #PF = 30
  • During the time range (T1, T2]:
    • pid = 1001, tid = 1, #PF = 53
    • pid = 1001, tid = 2, #PF = 38
  • During the time range (T2, T3]
    • pid = 1001, tid = 1, #PF = 56
    • pid = 1001, tid = 2, #PF = 42
  • During the time range (T3, T4]:
    • pid = 1001, tid = 1, #PF = 60
    • pid = 1001, tid = 2, #PF = 47
  • During the time range (T4, T5]:
    • thread 1 died, thread 3 started
    • pid = 1001, tid = 2, #PF = 53
    • pid = 1001, tid = 3, #PF = 5

Note that in the following examples, Cumulative aggregation temporality is discussed before Delta aggregation temporality because asynchronous Counter and UpDownCounter measurements are input to the API with specified Cumulative aggregation temporality.

Asynchronous example: Cumulative temporality

If we export the metrics using Cumulative Temporality:

  • (T0, T1]
    • attributes: {pid = 1001, tid = 1}, sum: 50
    • attributes: {pid = 1001, tid = 2}, sum: 30
  • (T0, T2]
    • attributes: {pid = 1001, tid = 1}, sum: 53
    • attributes: {pid = 1001, tid = 2}, sum: 38
  • (T0, T3]
    • attributes: {pid = 1001, tid = 1}, sum: 56
    • attributes: {pid = 1001, tid = 2}, sum: 42
  • (T0, T4]
    • attributes: {pid = 1001, tid = 1}, sum: 60
    • attributes: {pid = 1001, tid = 2}, sum: 47
  • (T0, T5]
    • attributes: {pid = 1001, tid = 2}, sum: 53
    • attributes: {pid = 1001, tid = 3}, sum: 5

The behavior in the first four periods is quite straightforward - we just take the data being reported from the asynchronous instruments and send them.

The data model prescribes several valid behaviors at T5 in this case, where one stream dies and another starts. The Resets and Gaps section describes how start timestamps and staleness markers can be used to increase the receiver’s understanding of these events.

Consider whether the SDK maintains individual timestamps for the individual stream, or just one per process. In this example, where a thread can die and start counting page faults from zero, the valid behaviors at T5 are:

  1. If all streams in the process share a start time, and the SDK is not required to remember all past streams: the thread restarts with zero sum. Receivers with reset detection are able to calculate a correct rate (except for frequent restarts relative to the collection interval), however the precise time of a reset will be unknown.
  2. If the SDK maintains per-stream start times, it signals to the receiver precisely when a stream started, making the first observation in a stream more useful for diagnostics. Receivers can perform overlap detection or duplicate suppression and do not require reset detection, in this case.
  3. Independent of above treatments, the SDK can add a staleness marker to indicate the start of a gap in the stream when one thread dies by remembering which streams have previously reported but are not currently reporting. If per-stream start timestamps are used, staleness markers can be issued to precisely start a gap in the stream and permit forgetting streams that have stopped reporting.

It’s OK to ignore the options to use per-stream start timestamps and staleness markers. The first course of action above requires no additional memory or code to achieve and is correct in terms of the data model.

Asynchronous example: Delta temporality

If we export the metrics using Delta Temporality:

  • (T0, T1]
    • attributes: {pid = 1001, tid = 1}, delta: 50
    • attributes: {pid = 1001, tid = 2}, delta: 30
  • (T1, T2]
    • attributes: {pid = 1001, tid = 1}, delta: 3
    • attributes: {pid = 1001, tid = 2}, delta: 8
  • (T2, T3]
    • attributes: {pid = 1001, tid = 1}, delta: 3
    • attributes: {pid = 1001, tid = 2}, delta: 4
  • (T3, T4]
    • attributes: {pid = 1001, tid = 1}, delta: 4
    • attributes: {pid = 1001, tid = 2}, delta: 5
  • (T4, T5]
    • attributes: {pid = 1001, tid = 2}, delta: 6
    • attributes: {pid = 1001, tid = 3}, delta: 5

You can see that we are performing Cumulative->Delta conversion, and it requires us to remember the last value of every single permutation we’ve encountered so far, because if we don’t, we won’t be able to calculate the delta value using current value - last value. And as you can tell, this is super expensive.

Making it more interesting, if we have min/max value, it is mathematically impossible to reliably deduce the Delta temporality from Cumulative temporality. For example:

  • If the maximum value is 10 during (T0, T2] and the maximum value is 20 during (T0, T3], we know that the maximum value during (T2, T3] must be 20.
  • If the maximum value is 20 during (T0, T2] and the maximum value is also 20 during (T0, T3], we wouldn’t know what the maximum value is during (T2, T3], unless we know that there is no value (count = 0).

So here are some suggestions that we encourage SDK implementers to consider:

  • If you have to do Cumulative->Delta conversion, and you encountered min/max, rather than drop the data on the floor, you might want to convert them to something useful - e.g. Gauge.
Asynchronous example: attribute removal in a view

Suppose the metrics in the asynchronous example above are exported through a view configured to remove the tid attribute, leaving a single-dimensional count of page faults by pid. For each metric stream, two measurements are produced covering the same interval of time, which the SDK is expected to aggregate before producing the output.

The data model specifies to use the “natural merge” function, in this case meaning to add the current point values together because they are Sum data points. The expected output is, still in Cumulative Temporality:

  • (T0, T1]
    • dimensions: {pid = 1001}, sum: 80
  • (T0, T2]
    • dimensions: {pid = 1001}, sum: 91
  • (T0, T3]
    • dimensions: {pid = 1001}, sum: 98
  • (T0, T4]
    • dimensions: {pid = 1001}, sum: 107
  • (T0, T5]
    • dimensions: {pid = 1001}, sum: 58

As discussed in the asynchronous cumulative temporality example above, there are various treatments available for detecting resets. Even if the first course is taken, which means doing nothing, a receiver that follows the data model’s rules for unknown start time and inserting true start times will calculate a correct rate in this case. The “58” received at T5 resets the stream - the change from “107” to “58” will register as a gap and rate calculations will resume correctly at T6. The rules for reset handling are provided so that the unknown portion of “58” that was counted reflected in the “107” at T4 is not double-counted at T5 in the reset.

If the option to use per-stream start timestamps is taken above, it lightens the duties of the receiver, making it possible to monitor gaps precisely and detect overlapping streams. When per-stream state is available, the SDK has several approaches for calculating Views available in the presence of attributes that stop reporting and then reset some time later:

  1. By remembering the cumulative value for all streams across the lifetime of the process, the cumulative sum will be correct despite attributes that come and go. The SDK has to detect per-stream resets itself in this case, otherwise the View will be calculated incorrectly.
  2. When the cost of remembering all streams attributes becomes too high, reset the View and all its state, give it a new start timestamp, and let the caller see a a gap in the stream.

When considering this matter, note also that the metrics API has a recommendation for each asynchronous instrument: User code is recommended not to provide more than one Measurement with the same attributes in a single callback.. Consider whether the impact of user error in this regard will impact the correctness of the view. When maintaining per-stream state for the purpose of View correctness, SDK authors may want to consider detecting when the user makes duplicate measurements. Without checking for duplicate measurements, Views may be calculated incorrectly.

Memory management

Memory management is a wide topic, here we will only cover some of the most important things for OpenTelemetry SDK.

Choose a better design so the SDK has less things to be memorized, avoid keeping things in memory unless there is a must need. One good example is the aggregation temporality.

Design a better memory layout, so the storage is efficient and accessing the storage can be fast. This is normally specific to the targeting programming language and platform. For example, aligning the memory to the CPU cache line, keeping the hot memories close to each other, keeping the memory close to the hardware (e.g. non-paged pool, NUMA).

Pre-allocate and pool the memory, so the SDK doesn’t have to allocate memory on-the-fly. This is especially useful to language runtimes that have garbage collectors, as it ensures the hot path in the code won’t trigger garbage collection.

Limit the memory usage, and handle critical memory condition. The general expectation is that a telemetry SDK should not fail the application. This can be done via some cardinality-capping algorithm - e.g. start to combine/drop some data points when the SDK hits the memory limit, and provide a mechanism to report the data loss.

Provide configurations to the application owner. The answer to “what is an efficient memory usage” is ultimately depending on the goal of the application owner. For example, the application owners might want to spend more memory in order to keep more permutations of metrics attributes, or they might want to use memory aggressively for certain attributes that are important, and keep a conservative limit for attributes that are less important.