Recording errors
Status: Development.
This document provides recommendations to semantic convention and instrumentation authors on how to record errors on spans and metrics.
Individual semantic conventions are encouraged to provide additional guidance.
What constitutes an error
An operation SHOULD be considered as failed if any of the following is true:
an exception is thrown by the instrumented operation (API, block of code, or another instrumented unit)
the instrumented operation returns an error in another way, for example, via an error code
Semantic conventions that define domain-specific status codes SHOULD specify which status codes should be reported as errors by a general-purpose instrumentation.
The classification of a status code as an error depends on the context. For example, an HTTP 404 “Not Found” status code indicates an error if the application expected the resource to be available. However, it is not an error when the application is simply checking whether the resource exists.
Instrumentations that have additional context about a specific request MAY use this context to set the span status more precisely.
Errors that were retried or handled (allowing an operation to complete gracefully) SHOULD NOT be recorded on spans or metrics that describe this operation.
Recording errors on spans
Span Status Code MUST be left unset if the instrumented operation has ended without any errors.
When the operation ends with an error, instrumentation:
SHOULD set the span status code to
ErrorSHOULD set the
error.typeattributeSHOULD set the span status description when it has additional information about the error which is not expected to contain sensitive details and aligns with Span Status Description definition.
It’s NOT RECOMMENDED to duplicate status code or
error.typein span status description.When the operation fails with an exception, the span status description SHOULD be set to the exception message.
Refer to the recording exceptions on capturing exception details.
Recording errors on metrics
Semantic conventions for operations usually define an operation duration histogram
metric. This metric SHOULD include the error.type attribute. This enables users to derive
throughput and error rates.
Operations that complete successfully SHOULD NOT include the error.type attribute,
allowing users to filter out errors.
Semantic conventions SHOULD include error.type on other metrics when it’s applicable.
For example, messaging.client.sent.messages metric measures message throughput (one
messaging operation may involve sending multiple messages) and includes error.type.
It’s RECOMMENDED to report one metric that includes successes and failures as opposed to reporting two (or more) metrics depending on the operation status.
Instrumentation SHOULD ensure error.type is applied consistently across spans
and metrics when both are reported. A span and its corresponding metric for a single
operation SHOULD have the same error.type value if the operation failed and SHOULD NOT
include it if the operation succeeded.
Recording exceptions
When the instrumented operation failed due to an exception:
- instrumentation SHOULD record this exception as a log record,
- instrumentation SHOULD follow recording errors on spans and recording errors on metrics on capturing exception details on these signals.
It’s NOT RECOMMENDED to record the same exception more than once. It’s NOT RECOMMENDED to record exceptions that are handled by the instrumented library.
For example, in this code-snippet, ResourceAlreadyExistsException is handled and the corresponding
native instrumentation should not record it. Exceptions which are propagated
to the caller should be recorded (or logged) once.
public boolean createIfNotExists(String resourceId) throws IOException {
Span span = startSpan();
long startTime = System.nanoTime();
try {
create(resourceId);
recordMetric("acme.resource.create.duration", System.nanoTime() - startTime);
return true;
} catch (ResourceAlreadyExistsException e) {
// we do not set span status to error and the "error.type" attribute
// as the exception is not an error,
// but we still log and set attributes that capture additional details
logger.withEventName("acme.resource.create.error")
.withAttribute("acme.resource.create.status", "already_exists")
.withException(e)
.debug();
span.setAttribute(AttributeKey.stringKey("acme.resource.create.status"), "already_exists");
recordMetric("acme.resource.create.duration", System.nanoTime() - startTime);
return false;
} catch (IOException e) {
// this exception is expected to be handled by the caller
// and could be a transient error
logger.withEventName("acme.resource.create.error")
.withException(e)
.warn();
String errorType = e.getClass().getCanonicalName();
span.setAttribute(AttributeKey.stringKey("error.type"), errorType);
span.setStatus(StatusCode.ERROR, e.getMessage());
recordMetric("acme.resource.create.duration", System.nanoTime() - startTime,
AttributeKey.stringKey("error.type"), errorType);
throw e;
}
}
Feedback
Was this page helpful?
Thank you. Your feedback is appreciated!
Please let us know how we can improve this page. Your feedback is appreciated!