End-User Q&A Series: Using OTel at Farfetch
On Thursday, May 25th, 2023, the OpenTelemetry (OTel) End User Working Group hosted its third End User Q&A session of 2023. We had a bit of a gap due to KubeCon Europe, but now we’re back! This series is a monthly casual discussion with a team using OpenTelemetry in production. The goal is to learn more about their environment, their successes, and the challenges that they face, and to share it with the community, so that together, we can help make OpenTelemetry awesome!
Iris is a huge fan of observability and OpenTelemetry, and her love of these two topics is incredibly infectious.
In this session, Iris shared:
- Farfetch’s journey to OpenTelemetry
- How metrics and traces are instrumented
- OpenTelemetry Collector deployment and configuration
Iris is part of a central team that provides tools for all the engineering teams across Farfetch to monitor their services, including traces, metrics, logs, and alerting. The team is responsible for maintaining Observability tooling, managing deployments related to Observability tooling, and educating teams on instrumenting code using OpenTelemetry.
Iris first started her career as a software engineer, focusing on backend development. She eventually moved to a DevOps Engineering role, and it was in this role that she was introduced to cloud monitoring through products such as Amazon CloudWatch and Azure App Insights. The more she learned about monitoring, the more it became a passion for her.
She then moved into another role where she was introduced to OpenTelemetry, Prometheus, and Grafana, and got to dabble a little more in the world of Observability. This role became an excellent stepping stone for her current role at Farfetch, which she has been doing for a little over a year now.
Iris first heard about OpenTelemetry on LinkedIn. The company she was working at at the time, which was not using traces, had started exploring the possibility of using them and was looking into tracing solutions. After reading about OpenTelemetry, Iris created a small Proof-of-Concept (POC) for her manager. While nothing had moved past the POC at that role, when Iris joined Farfetch and OpenTelemetry came up again, she jumped at the chance to work with it.
Farfetch currently has 2000 engineers, with a complex and varied architecture which includes cloud-native, Kubernetes, and virtual machines running on three different cloud providers. There is a lot of information coming from everywhere, with a lack of standardization on how to collect this information. For example, Prometheus is used mostly as a standard for collecting metrics; however, in some cases, engineers found that Prometheus did not suit their needs. With the introduction of OpenTelemetry, Farfetch was able to standardize the collection of both metrics and traces, and enabled them to collect telemetry signals from services where signal collection had not previously been possible.
Farfetch uses Jenkins for CI/CD, and there is a separate team that manages this.
Iris’ team uses mostly open source tooling, alongside some in-house tooling created by her team. On the open source tooling front:
- Grafana is used for dashboards
- OpenTelemetry is used for emitting traces, and Grafana Tempo is used as a tracing backend
- Jaeger is still used in some cases for emitting traces and as a tracing backend, because some teams have not yet completely moved to OpenTelemetry for instrumenting traces (via Jaeger’s implementation of the OpenTracing API).
- Prometheus Thanos (highly-available Prometheus) is used for metrics collection and storage
- OpenTelemetry is also being used to collect metrics
Farfetch is a very Observability-driven organization, so when senior leadership floated the idea of bringing OpenTelemetry into the organization, it got overwhelming support across the organization. The biggest challenge faced around OpenTelemetry was around timing for its implementation; however, once work on OpenTelemetry started, everyone embraced it.
By the time Iris joined Farfetch, most of the big struggles and challenges around Observability had passed. When Observability was first introduced within the organization, it was very new and unknown to many engineers there, and as with all new things, there is a learning curve.
When Iris and her team took on the task of enabling OpenTelemetry across the organization, Observability as a concept had already been embraced. Their biggest challenge in bringing OpenTelemetry to Farfetch was making sure that engineers did not experience major disruptions to their work, while still benefiting from having OpenTelemetry in place. It helped that OpenTelemetry is compatible with many of the tools in their existing Observability stack, including Jaeger and Prometheus.
Due to the enthusiasm, drive, and push that Iris and one of her co-workers, an architect at Farfetch, made for OpenTelemetry, Iris was proud to share that they are now using OpenTelemetry in production.
Iris and her team planned to start using OpenTelemetry in January 2023. This included initial investigation and information-gathering. By mid-March, they had their first pieces in production.
They are not fully there yet:
- There is still a lot of reliance on Prometheus and Jaeger for generating metrics and traces, respectively
- Not all applications have been instrumented with OpenTelemetry
In spite of that, Iris and her team are leveraging the power of the OpenTelemetry Collector to gather and send metrics and traces to various Observability backends. Since she and her team started using OpenTelemetry, they started instrumenting more traces. In fact, with their current setup, Iris has happily reported that they went from processing 1,000 spans per second, to processing 40,000 spans per second!
Traces are being collected through a combination of manual and auto instrumentation.
Some applications are being manually instrumented through OpenTelemetry, and others are still instrumented using the [legacy OpenTracing using shims.
The OpenTelemetry Operator is being implemented to auto-instrument Java and .NET code. Among other things, the OTel Operator supports injecting and configuring auto-instrumentation in .NET, Java, Python, and Node.js. Iris hopes that Go auto-instrumentation will be available in the near-future. To track progress of auto-instrumentation in Go, see OpenTelemetry Go Automatic Instrumentation.
Although this will be a lengthy and time-consuming process, the team’s goal is to have all applications instrumented using OpenTelemetry.
By design, Iris and her team don’t instrument other teams’ code. Instead, they provide documentation and guidelines on manual instrumentation, and refer teams to the OpenTelemetry docs where applicable. They also have sessions with engineers to show them best practices around instrumenting their own code. It’s a team sport!
The OTel Operator is only partially used in production, and is currently not available for everyone. Iris and her team really love the OTel Operator; however, it did take a bit of getting used to. Iris and her team found that there is a tight coupling between cert-manager and the OTel Operator. They were not able to use our own custom certificates, and they did not support cert-manager in their clusters, so they found it hard to use the Operator in our clusters. They solved this by submitting a PR – opentelemetry-helm-charts PR #760!
One of the things she loves about OpenTelemetry was that, when she was trying to troubleshoot an issue whereby Prometheus was not sending metrics to the Collector, and was therefore not able to create alerts from it. Then a colleague suggested using OpenTelemetry to troubleshoot OpenTelemetry.
Iris has played around a bit with OTel logging, mostly consuming logs from a Kafka topic. This experiment has not included log correlation, but it is something that Iris would like to explore further.
Since logs are not yet stable, Iris doesn’t expect logging to go into production at Farfetch just yet. Farfetch has a huge volume of logs (more than traces), so they don’t want to start converting to OTel logging until things are more stable.
Note: some parts of OTel logs are stable. For details, see Specification Status Summary.
Auto-instrumentation emits some OTLP metrics; however, the majority of metrics still come from Prometheus.
The team currently uses the Prometheus Receiver to scrape metrics from Consul. Specifically, they use Consul to get the targets and the ports where to scrape them. The Receiver’s scrape configs are the same as in Prometheus, so it was relatively easy to move from Prometheus to the Prometheus Receiver (lift and shift).
They also plan to collect OTLP metrics from Kubernetes. This is facilitated by the Prometheus Receiver’s support for the OTel Operator’s Target Allocator.
Prometheus is also still currently used for metrics collection in other areas, and will probably remain this way, especially when collecting metrics from virtual machines.
There are 100 Kubernetes clusters being observed, and thousands of virtual machines. Iris and her team are responsible for managing the OTel Operator across all of these clusters, and are therefore also trained in Kubernetes, so that they can maintain their stack on the clusters.
This question is referring to the ability for Kubernetes components to emit OTLP traces which can then be consumed by the OTel Collector. For more info, see Traces For Kubernetes System Components. This feature is currently in beta, and was first introduced in Kubernetes 1.25.
Iris and team have not played around with this beta feature.
Because there are so many Kubernetes clusters, having a single OTel Collector would be a bottleneck in terms of load and single point of failure. The team currently has one OpenTelemetry Collector agent per Kubernetes cluster. The end goal is to replace those agents with the OTel Operator instead, which allows you to deploy and configure the OTel Collector and inject and configure auto-instrumentation.
Everything is then sent to a central OTel Collector (i.e. an OTel Collector gateway) per data center, where data masking (using the transform processor, or redaction processor), data sampling (e.g. tail sampling processor or probabilistic sample processor), and other things happen. It then sends traces to Grafana Tempo.
The central OTel Collector resides on another Kubernetes cluster that belongs solely to the Farfetch Observability team, which runs the Collector and other applications that belong to the team.
The team has fallback clusters, so that if a central Collector fails, the fallback cluster will be used in its place. The satellite clusters are configured to send data to the central Collector on the fallback cluster, so if the central cluster fails, the fallback cluster can be brought up without disruption to OTel data flow.
Having autoscaling policies in place to ensure that the Collectors have enough memory and CPU to handle data loads also helps to keep the system highly available.
The biggest challenge was getting to know the Collector and how to use it effectively. Farfetch relies heavily on auto-scaling, so one of the first things that the team did was to enable auto-scaling for the Collectors, and tweak settings to make sure that it could handle large amounts of data.
The team also leaned heavily on OTel Helm charts, and on the OTel Community for additional support.
Are you currently using any processors on the OTel Collector?
The team is currently experimenting with processors, namely for data masking (transform processor, or redaction processor), especially as they move to using OTel Logs, which will contain sensitive data that they won’t want to transmit to their Observability backend. They currently, however, are only using the batch processor.
A span event provides additional point-in-time information in a trace. It’s basically a structured log within a span.
Not at the moment, but it is something that they would like to explore. When the Observability team first started, there was little interest in tracing. As they started implementing OpenTelemetry and tracing, they have moved to make traces first-class citizens, and now it is piquing the interest of engineers, as they begin to see the relevance of traces.
Farfetch is a very Observability-driven culture, and the Observability team hasn’t really encountered anyone who is against Observability or OpenTelemetry. Some engineers might not care either way, but they are not opposed to it, either.
The team, led by the architect, has made a contribution recently to the OTel Operator around certificates. The OTel Operator relied on cert-manager for certificates, rather than custom certificates. They initially put in a feature request, but then decided to develop the feature themselves, and filed a pull request.
When their Collector was processing around 30,000 spans per second, there were 4 instances of the Collector, using around 8GB memory.
This is something that is currently being explored. The team is exploring traces/metrics correlation (exemplars) through OpenTelemetry; however, they found that this correlation is accomplished more easily through their tracing backend, Tempo.
Are you concerned about the amount of data that you end up producing, transporting, and collecting? How do you ensure data quality?
This is not a concern, since the volume of data never changed, and the team knows that they can handle these large volumes. The team is simply changing how the data is being produced, transported, and collected. Iris also recognizes that the amount of trace data is gradually increasing; however, the data increase is gradual, so that the team can prepare itself to handle larger data volumes.
The team is working very hard to ensure that they are getting quality data. This is especially true for metrics, where they are cleaning up metrics data to make sure that they are processing meaningful data. If a team decided to drastically increase the volume of metrics it emits, the Observability team is consulted beforehand, to ensure that the increase makes sense.
Since trace volumes were initially a low lower, they did not need to concern themselves with trace sampling. Now that trace volume is increasing, they are keeping a close eye on things.
The team is also focusing its attention on data quality and volume of logs, which means researching log processors to see which ones suit their needs. Ultimately, they will publish a set of guidelines for development teams to follow, and evangelize practices within the company.
Iris and her team have had a very positive experience with OpenTelemetry and the OpenTelemetry community.
Iris shared that the docs at times are not as clear as they could be, requiring some extra digging on the part of the engineer, to understand how a certain component works or is supposed to be configured. For example, she had a hard time finding documentation on Consul SD configuration for OpenTelemetry. That being said, Iris is hoping to contribute back to docs to help improve them.
Iris and her team were pleasantly surprised by the quick turnaround time on getting their OTel Operator PR approved and merged.
My conversation with Iris, in full, is available on YouTube.
If anyone would like to continue the conversation with Iris, reach out to her in the #otel-user-research Slack channel!
She will also be presenting at OTel in Practice on June 8th.
OpenTelemetry is all about community, and we wouldn’t be where we are without our contributors, maintainers, and users. Hearing stories of how OpenTelemetry is being implemented in real life is only part of the picture. We value user feedback, and encourage all of our users to share your experiences with us, so that we can continue to improve OpenTelemetry. ❣️
If you have a story to share about how you use OpenTelemetry at your organization, we’d love to hear from you! Ways to share:
- Join the #otel-endusers channel on the CNCF Community Slack
- Join our monthly End Users Discussion Group calls
- Join our OTel in Practice sessions
- Sign up for one of our monthly interview/feedback sessions
- Join the OpenTelemetry group on LinkedIn
- Share your stories on the OpenTelemetry blog