Recording errors

Status: Development.

This document provides recommendations to semantic convention and instrumentation authors on how to record errors on spans and metrics.

Individual semantic conventions are encouraged to provide additional guidance.

What constitutes an error

An operation SHOULD be considered as failed if any of the following is true:

  • an exception is thrown by the instrumented operation (API, block of code, or another instrumented unit)

  • the instrumented operation returns an error in another way, for example, via an error code

    Semantic conventions that define domain-specific status codes SHOULD specify which status codes should be reported as errors by a general-purpose instrumentation.

Errors that were retried or handled (allowing an operation to complete gracefully) SHOULD NOT be recorded on spans or metrics that describe this operation.

Recording errors on spans

Span Status Code MUST be left unset if the instrumented operation has ended without any errors.

When the operation ends with an error, instrumentation:

  • SHOULD set the span status code to Error

  • SHOULD set the error.type attribute

  • SHOULD set the span status description when it has additional information about the error which is not expected to contain sensitive details and aligns with Span Status Description definition.

    It’s NOT RECOMMENDED to duplicate status code or error.type in span status description.

    When the operation fails with an exception, the span status description SHOULD be set to the exception message.

Refer to the recording exceptions on capturing exception details.

Recording errors on metrics

Semantic conventions for operations usually define an operation duration histogram metric. This metric SHOULD include the error.type attribute. This enables users to derive throughput and error rates.

Operations that complete successfully SHOULD NOT include the error.type attribute, allowing users to filter out errors.

Semantic conventions SHOULD include error.type on other metrics when it’s applicable. For example, messaging.client.sent.messages metric measures message throughput (one messaging operation may involve sending multiple messages) and includes error.type.

It’s RECOMMENDED to report one metric that includes successes and failures as opposed to reporting two (or more) metrics depending on the operation status.

Instrumentation SHOULD ensure error.type is applied consistently across spans and metrics when both are reported. A span and its corresponding metric for a single operation SHOULD have the same error.type value if the operation failed and SHOULD NOT include it if the operation succeeded.

Recording exceptions

When the instrumented operation failed due to an exception:

It’s NOT RECOMMENDED to record the same exception more than once. It’s NOT RECOMMENDED to record exceptions that are handled by the instrumented library.

For example, in this code-snippet, ResourceAlreadyExistsException is handled and the corresponding native instrumentation should not record it. Exceptions which are propagated to the caller should be recorded (or logged) once.

public boolean createIfNotExists(String resourceId) throws IOException {
  Span span = startSpan();
  long startTime = System.nanoTime();
  try {
    create(resourceId);

    recordMetric("acme.resource.create.duration", System.nanoTime() - startTime);

    return true;
  } catch (ResourceAlreadyExistsException e) {
    // we do not set span status to error and the "error.type" attribute
    // as the exception is not an error,
    // but we still log and set attributes that capture additional details
    logger.withEventName("acme.resource.create.error")
      .withAttribute("acme.resource.create.status", "already_exists")
      .withException(e)
      .debug();

    span.setAttribute(AttributeKey.stringKey("acme.resource.create.status"), "already_exists");

    recordMetric("acme.resource.create.duration", System.nanoTime() - startTime);

    return false;
  } catch (IOException e) {
    // this exception is expected to be handled by the caller
    // and could be a transient error
    logger.withEventName("acme.resource.create.error")
      .withException(e)
      .warn();

    String errorType = e.getClass().getCanonicalName();

    span.setAttribute(AttributeKey.stringKey("error.type"), errorType);
    span.setStatus(StatusCode.ERROR, e.getMessage());

    recordMetric("acme.resource.create.duration", System.nanoTime() - startTime,
                 AttributeKey.stringKey("error.type"), errorType);
    throw e;
  }
}