OpenTelemetry: Observability for Mobiles with Hanson Ho from Embrace | ποΈ#39
You all know what Observability is. But what do you know about the complexities of collecting telemetry data from mobile apps and devices, and how to do it with OpenTelemetry? That's the topic for Episode 39 of our DevOps Accents podcast with Hanson Ho, Android Architect at Embrace, the only user-focused, mobile-first observability solution built on OpenTelemetry. By providing actionable insights across DevOps and mobile teams, Embrace highlights the true impact of issues on user experience, not just on server performance. We discuss how this can be achieved with OpenTelemetry, if at all possible, and what are the main obstacles to achieving comprehensive observability for mobile applications.
Also in this episode:
- Importance of observability in complex IT environments
- Challenges and differences in mobile observability
- Impact of complexity on Mean Time to Recovery (MTTR)
- The importance of Production Data in Mobile Observability
- Optimizing data collection and standardization for cost-effective Observability
- Impact of integration on Mobile Development
- The future of Observability
You can listen to episode 39 of DevOps Accents on Spotify, or right now:
Understanding the Need for Observability
In today's IT landscape, observability is more crucial than ever. As systems grow in complexity, the ability to monitor and understand the intricate workings of these environments becomes essential. Observability is not just about keeping an eye on things; it's about gaining deep insights into system behavior by collecting and analyzing a wide array of data. This enables organizations to detect and address issues before they escalate, ensuring smooth operations and rapid problem resolution.
Challenges in Mobile Observability
Mobile observability presents unique challenges compared to traditional desktop or server environments. The variability in device performance, network conditions, and user interactions adds layers of complexity. Traditional observability tools, designed with back-end systems in mind, often fall short in capturing the nuances of mobile environments. This results in a significant gap between front-end and back-end observability, leading to missed insights and prolonged issue resolution times.
For example, Hanson Ho, an Android architect from Embrace, highlights the unpredictability of mobile environments. He explains that mobile apps run in a single process with limited threads, unlike the distributed systems typical of back-end environments. Moreover, mobile devices operate in varying conditions, from poor network connectivity to devices with limited resources, making it challenging to collect consistent and reliable data.
Traditionally, mobile observability, in the mobile sense, is about what they call RMS, runtime user monitoring. It's about knowing your crashes; it's about knowing, you know, various things that folks are familiar with in the mobile space to fix problems. Those don't tend to marry well with observability practices, with traces and things like that. So, in the recent past, folks have been more focused on performance. So logging, crashes, and shining crashes are simply not enough. They want to kind of bring the idea of observability and this tooling into the client. But what we face is a different set of challenges given that we, on the client, an app is fairly simple. It's one process generally, with a handful of threads. There isn't a lot of variance like a distributed trace you would have in the backend. So, some of the tooling designed for context propagation of distributed tracing gets a little bit complex when applied to the client model. And frankly, client devs, or mobile devs, I should say, are also not from a variety of experiences. So not everybody is familiar with what a thread is, not everybody is familiar with tracing context, what a span is, things like that. So when you give them an API, you say, "Oh yeah, you gotta do this and this and this," they throw their hands up and say, "Oh, I don't know, Google gives me, you know, these very simple-to-use APIs, and I know how to do that, but anything harder, it is difficult." β Hanson Ho
So, just educating the community is not as straightforward as with backend devs. There are just a lot of assumptions that were made for backend observability that just do not apply to mobile. Itβs an unstable environment. When you collect the data, it doesn't mean it gets transmitted. If the user goes offline, the data is gone; data could be delayed. Data could be captured wrongly because an OEM from a different part of the world decided to override the system function that makes things a little different. So the data you get is dirty, and you don't have a lot of expectation in terms of, "I expect, you know, p99 to run this way," because you have a fixed set of hardware you're running on; you could be running on really old devices, using really old networking equipment, just the social network is slow and you have a lot of unpredictability. It's just, in general, it's just not as solid and stable, the data you get. And you kind of have to prepare for that in terms of your data collection. β Hanson Ho
Impact of Complexity on Mean Time to Recovery (MTTR)
One of the critical issues in modern observability is the increasing Mean Time to Recovery (MTTR) for production incidents. Despite advancements in observability tools, the complexity of IT environments makes it difficult to quickly pinpoint and resolve issues. This paradox arises because more data does not always translate into faster fixes. The sheer volume of data generated by various components, including cloud services and microservices, can overwhelm traditional monitoring systems, making it hard to correlate and analyze information efficiently.
In mobile observability, this complexity is even more pronounced. Mobile applications must function across diverse devices and network conditions, making it essential to capture detailed contextual data from production environments. Ho emphasizes that obtaining production data is crucial because it provides insights into real-world user experiences, which are often difficult to replicate in testing environments.
So, we can always discuss this in two segments. One is if the issue actually happens in the mobile app, or if the issue is something on the backend. Let's talk about the mobile app first. Traditionally, observability tools in mobile are designed to find the needles in haystacks. We don't look at p99.9; we look at specific instances of crashes because we are trying to find needles in haystacks, as that's generally what production data helps us with. Strategies to prevent those from becoming a big issue are actually not with observability. It's more with your rollout practices because typically, bugs occur when we upgrade versions. Obviously, a new version has tons of changes, and those changes could have unintended consequences once released. Typically, what folks do is they do a staged rollout, so they do 1% for 24 hours, see that there aren't major problems, and then roll the rest out. And if there are particularly challenging features that may be buggy, it's put behind a feature flag so that it can be disabled in production. So, doing those two things is already, I wouldn't say good enough. β Hanson Ho
When you bring in the mobile data and connect it to the backend, you can start seeing things that perhaps you're not capturing on the mobile or on the backend side. Maybe the details of a payload of a request are too large to put into a log. But on mobile, we do record that because it's done, you know, not as part of the request to the backend. If you have a way of linking your client data with your backend data, you can see the specific requests that generated the particularly slow trace, that request, and that user; you have the context of what led to that. Ultimately, what is affected on the backend is what's in the request. You can know the request payload, the request parameters that you send in the URL or headers that you may not be capturing entirely from your frontend, your Nginx, which might be stripping away that data and just basically passing in what you think is relevant to your user billing systems. You mentioned volume being a problem before; well, you can't capture everything, it's too expensive. But if your mobile durability captures it and puts it in a different system, you can then link the two sets of data and get additional context. So instead of trying to find the needle in the haystack, you're looking at it from a totally different perspective using a metal detector. You don't have to use your eyes; you're using another technology, giving you more context. In terms of MTTR on mobile, you have to wait for the app store to release your new update. So generally, it's not a fantastic metric for us, but on the server side, it can give you more data and more context and help reduce something that you could fix on the server side. β Hanson Ho
The Importance of Production Data in Mobile Observability
Production data plays a vital role in mobile observability. Unlike controlled testing environments, production data reflects actual user interactions and conditions, providing invaluable insights into how applications perform in the real world. Ho notes that traditional mobile observability tools have focused on crash reporting and basic performance metrics. However, these tools often lack the depth required to diagnose complex issues.
Embrace, for instance, uses production data to understand the sequence of events leading to crashes or performance slowdowns. This includes monitoring user sessions and capturing detailed context about the device's state, such as whether it was running on low power or experiencing memory constraints. By linking this data with back-end observability systems through open standards like OpenTelemetry, organizations can gain a comprehensive view of their application's performance and quickly identify and resolve issues.
Optimizing Data Collection and Standardization for Cost-Effective Observability
As data volumes increase, so do the costs associated with storing and analyzing this data. Organizations must adopt strategies to optimize data collection and manage costs effectively while maintaining comprehensive coverage. One approach is to be selective about the data collected, focusing on high-value telemetry that provides actionable insights. This involves using structured logs and wide events with rich context, enabling more efficient data analysis without storing excessive amounts of information.
OpenTelemetry offers a promising solution by providing a standardized protocol for collecting and processing observability data. This open standard allows organizations to switch between different vendors and tools without losing data compatibility, ensuring cost-effective observability. Ho points out that by using OpenTelemetry, mobile and back-end teams can speak the same "language," making it easier to integrate and analyze data from different parts of the system.
Folks are complaining about large data bills because they're storing data in large volumes from collecting tons of logs. The first strategy you might imagine is, well, we'll just delete some; we'll delete the old ones. But what you're doing, as you said, is you might be deleting important context just because the data you collect is necessary for you to debug. So, if you're deleting it, how are you going to do it? Well, I guess you have a shorter time period to debug, and that's one way of doing it. But I think you only get incremental gains when you start doing that type of nipping and tucking on the edges. What's important is to be more efficient about the data you collect, how you collect it, and how you store it. Like, what are you collecting? Are you collecting really verbose logs that you need to process a lot just to get data with text searches in Splunk? Are you collecting data in such a way that it's easy to imbue context for transformation? Are you collecting structured logs? Are you collecting wide events with lots of context, lots of dimensions, so you can actually do filtering and grouping on this handful of events by changing the context? So there are probably smarter ways of collecting and storing data. I think that's one aspect of it. Another is if you're locked into a particular vendor and their proprietary platform, you're basically saying, well, if my alternative is to move off of you and migrate to a different system, and you know, migrations are very, very expensive and very problematic. I mean, maybe good for mkdev, consulting and helping folks with migration, but you know, as a holder of a budget, you don't want to do migrations as a backend developer unless you have to. So having a standard that makes it easy to migrate from one vendor to another. If your data is collected in an open standard and your data is processed by an open standard, it is much more attractive to say, well, if this vendor is charging me too much, can I move to another vendor? And basically do a migration, but the migration is a lot smaller. The data is the same in terms of shape; I'm just paying a different person to process it. So open telemetry allows us, or folks with choice, to say, you know what, I don't have to go with DataDog or any one of these big guys with a full-suite solution; I'm open telemetry. So I could move to other vendors that understand open telemetry, and process open telemetry, and collect open telemetry. Not only is it shifting what you collect important, but how you collect and what you store it as is also really important for managing costs going forward. You're not beholden to one vendor, which would be very expensive to get off of. β Hanson Ho
The Future of Observability
Looking ahead, the future of observability lies in the shift from traditional methods to more intelligent, context-aware data collection and analysis. This evolution, often referred to as Observability 2.0, focuses on collecting fewer but richer data events that provide deeper insights into system behavior. By moving towards this approach, organizations can reduce data storage costs while improving their ability to diagnose and resolve issues.
Ho envisions a future where observability tools are more accessible and tailored to different use cases, including mobile development. By embracing open standards and fostering collaboration between mobile and back-end teams, the industry can achieve a more integrated and effective observability ecosystem. This shift will enable faster recovery times, better performance insights, and ultimately, a more resilient IT environment.
My ideal future is a lot easier to achieve than that, I think. So, I really just want us to talk in the same language in a rich way that isn't rooted in a particular perspective. Open telemetry, great as it is, is still built for and by, and predominantly used by backend, just read tracing folks. And there are just certain things about it that I think don't fit super nicely when we switch it to the mobile context. So, my ideal future, which I'm hoping to be part of, to help push it forward, is that the API and the data models become a bit more friendly to mobile use cases. So that developers who are on Android or iOS or React Native or Flutter or Kotlin multiplatform have easy access and an easy gateway into it. β Hanson Ho
Show Notes:
- Previous episode of DA on Observability with Dotan Horovitz from logz.io
- Their observability trends we touched on in this episode
- Open Telemetry we've been talking about
- Our guest, Hanson Ho
- The company he is working for, Embrace
- Their open source SDKs you can play with for Android and iOS/Apple SDK
- Become a part of the Open Telemetry community and join CNCF Slack for Open Telemetry channels here
Podcast editing: Mila Jones / milajonesproduction@gmail.com