Charity is an ops engineer and accidental startup founder at Honeycomb. Before this she worked at Parse, Facebook, and Linden Lab on infrastructure and developer tools, and at all times looked as if it would wind up running the databases. She is the co-author of O’Reilly’s Database Reliability Engineering, and loves free speech, free software, and single malt scotch.
You were the Production Engineering Manager at Facebook (Now Meta) for over 2 years, what were a few of your highlights from this era and what are a few of your key takeaways from this experience?
I worked on Parse, which was a backend for mobile apps, kind of like Heroku for mobile. I had never been thinking about working at a giant company, but we were acquired by Facebook. Certainly one of my key takeaways was that acquisitions are really, really hard, even within the best of circumstances. The recommendation I at all times give other founders now’s this: in the event you’re going to be acquired, ensure you may have an executive sponsor, and think really hard about whether you may have strategic alignment. Facebook acquired Instagram not long before acquiring Parse, and the Instagram acquisition was hardly bells and roses, nevertheless it was ultimately very successful because they did have strategic alignment and a powerful sponsor.
I didn’t have a straightforward time at Facebook, but I’m very grateful for the time I spent there; I don’t know that I could have began an organization without the teachings I learned about organizational structure, management, strategy, etc. It also lent me a pedigree that made me attractive to VCs, none of whom had given me the time of day until that time. I’m somewhat cranky about this, but I’ll still take it.
Could you share the genesis story behind launching Honeycomb?
Definitely. From an architectural perspective, Parse was ahead of its time — we were using microservices before there have been microservices, we had a massively sharded data layer, and as a platform serving over one million mobile apps, we had quite a lot of really complicated multi-tenancy problems. Our customers were developers, they usually were continuously writing and uploading arbitrary code snippets and latest queries of, shall we embrace, “various quality” — and we just needed to take all of it in and make it work, one way or the other.
We were on the vanguard of a bunch of changes which have since gone mainstream. It was once that the majority architectures were pretty easy, and they’d fail repeatedly in predictable ways. You sometimes had an internet layer, an application, and a database, and a lot of the complexity was certain up in your application code. So you’d write monitoring checks to observe for those failures, and construct static dashboards on your metrics and monitoring data.
This industry has seen an explosion in architectural complexity over the past 10 years. We blew up the monolith, so now you may have anywhere from several services to 1000’s of application microservices. Polyglot persistence is the norm; as a substitute of “the database” it’s normal to have many various storage types in addition to horizontal sharding, layers of caching, db-per-microservice, queueing, and more. On top of that you just’ve got server-side hosted containers, third-party services and platforms, serverless code, block storage, and more.
The hard part was once debugging your code; now, the hard part is determining where within the system the code is that you’ll want to debug. As a substitute of failing repeatedly in predictable ways, it’s more likely the case that each single time you get paged, it’s about something you’ve never seen before and will never see again.
That’s the state we were in at Parse, on Facebook. Day by day the complete platform was happening, and each time it was something different and latest; a unique app hitting the highest 10 on iTunes, a unique developer uploading a nasty query.
Debugging these problems from scratch is insanely hard. With logs and metrics, you principally must know what you’re searching for before you could find it. But we began feeding some data sets right into a FB tool called Scuba, which allow us to slice and dice on arbitrary dimensions and high cardinality data in real time, and the period of time it took us to discover and resolve these problems from scratch dropped like a rock, like from hours to…minutes? seconds? It wasn’t even an engineering problem anymore, it was a support problem. You could possibly just follow the trail of breadcrumbs to the reply each time, clicky click click.
It was mind-blowing. This massive source of uncertainty and toil and unhappy customers and a couple of am pages just … went away. It wasn’t until Christine and I left Facebook that it dawned on us just how much it had transformed the way in which we interacted with software. The thought of going back to the bad old days of monitoring checks and dashboards was just unthinkable.
But on the time, we truthfully thought this was going to be a distinct segment solution — that it solved an issue other massive multitenant platforms might need. It wasn’t until we had been constructing for nearly a yr that we began to comprehend that, oh wow, this is definitely becoming an everyone problem.
For readers who’re unfamiliar, what specifically is an observability platform and the way does it differ from traditional monitoring and metrics?
Traditional monitoring famously has three pillars: metrics, logs and traces. You normally have to buy many tools to get your needs met: logging, tracing, APM, RUM, dashboarding, visualization, etc. Each of those is optimized for a unique use case in a unique format. As an engineer, you sit in the midst of these, attempting to make sense of all of them. You skim through dashboards searching for visual patterns, you copy-paste IDs around from logs to traces and back. It’s very reactive and piecemeal, and typically you confer with these tools when you may have an issue — they’re designed to show you how to operate your code and find bugs and errors.
Modern observability has a single source of truth; arbitrarily wide structured log events. From these events you may derive your metrics, dashboards, and logs. You possibly can visualize them over time as a trace, you may slice and dice, you may zoom in to individual requests and out to the long view. Because the whole lot’s connected, you don’t must jump around from tool to tool, guessing or counting on intuition. Modern observability isn’t nearly how you use your systems, it’s about the way you develop your code. It’s the substrate that permits you to hook up powerful, tight feedback loops that show you how to ship numerous value to users swiftly, with confidence, and find problems before your users do.
You’re known for believing that observability offers a single source of truth in engineering environments. How does AI integrate into this vision, and what are its advantages and challenges on this context?
Observability is like putting your glasses on before you go hurtling down the freeway. Test-driven development (TDD) revolutionized software within the early 2000s, but TDD has been losing efficacy the more complexity is positioned in our systems as a substitute of just our software. Increasingly, if you should get the advantages related to TDD, you really want to instrument your code and perform something akin to observability-driven development, or ODD, where you instrument as you go, deploy fast, then take a look at your code in production through the lens of the instrumentation you only wrote and ask yourself: “is it doing what I expected it to do, and does anything look … weird?”
Tests alone aren’t enough to substantiate that your code is doing what it’s speculated to do. You don’t know that until you’ve watched it bake in production, with real users on real infrastructure.
This sort of development — that features production in fast feedback loops — is (somewhat counterintuitively) much faster, easier and simpler than counting on tests and slower deploy cycles. Once developers have tried working that way, they’re famously unwilling to return to the slow, old way of doing things.
What excites me about AI is that if you’re developing with LLMs, you may have to develop in production. The one way you may derive a set of tests is by first validating your code in production and dealing backwards. I believe that writing software backed by LLMs shall be as common a skill as writing software backed by MySQL or Postgres in just a few years, and my hope is that this drags engineers kicking and screaming right into a higher lifestyle.
You have raised concerns about mounting technical debt as a consequence of the AI revolution. Could you elaborate on the forms of technical debts AI can introduce and the way Honeycomb helps in managing or mitigating these debts?
I’m concerned about each technical debt and, perhaps more importantly, organizational debt. Certainly one of the worst sorts of tech debt is when you may have software that isn’t well understood by anyone. Which suggests that any time you may have to increase or change that code, or debug or fix it, someone has to do the labor of learning it.
And in the event you put code into production that no person understands, there’s a excellent probability that it wasn’t written to be comprehensible. Good code is written to be easy to read and understand and extend. It uses conventions and patterns, it uses consistent naming and modularization, it strikes a balance between DRY and other considerations. The standard of code is inseparable from how easy it’s for people to interact with it. If we just start tossing code into production since it compiles or passes tests, we’re creating an enormous iceberg of future technical problems for ourselves.
In the event you’ve decided to ship code that no person understands, Honeycomb can’t help with that. But in the event you do care about shipping clean, iterable software, instrumentation and observability are absolutely essential to that effort. Instrumentation is like documentation plus real-time state reporting. Instrumentation is the one way you may truly confirm that your software is doing what you expect it to do, and behaving the way in which your users expect it to behave.
How does Honeycomb utilize AI to enhance the efficiency and effectiveness of engineering teams?
Our engineers use AI lots internally, especially CoPilot. Our more junior engineers report using ChatGPT every single day to reply questions and help them understand the software they’re constructing. Our more senior engineers say it’s great for generating software that will be very tedious or annoying to put in writing, like when you may have a large YAML file to fill out. It’s also useful for generating snippets of code in languages you don’t normally use, or from API documentation. Like, you may generate some really great, usable examples of stuff using the AWS SDKs and APIs, because it was trained on repos which have real usage of that code.
Nonetheless, any time you let AI generate your code, you may have to step through it line by line to make sure it’s doing the appropriate thing, since it absolutely will hallucinate garbage on the regular.
Could you provide examples of how AI-powered features like your query assistant or Slack integration enhance team collaboration?
Yeah, of course. Our query assistant is an incredible example. Using query builders is complicated and hard, even for power users. If you may have lots of or 1000’s of dimensions in your telemetry, you may’t at all times remember offhand what the Most worthy ones are called. And even power users forget the small print of find out how to generate certain sorts of graphs.
So our query assistant allows you to ask questions using natural language. Like, “what are the slowest endpoints?”, or “what happened after my last deploy?” and it generates a question and drops you into it. Most individuals find it difficult to compose a brand new query from scratch and straightforward to tweak an existing one, so it gives you a leg up.
Honeycomb guarantees faster resolution of incidents. Are you able to describe how the mixing of logs, metrics, and traces right into a unified data type aids in quicker debugging and problem resolution?
All the things is connected. You don’t must guess. As a substitute of eyeballing that this dashboard looks prefer it’s the identical shape as that dashboard, or guessing that this spike in your metrics should be the identical as this spike in your logs based on time stamps….as a substitute, the information is all connected. You don’t must guess, you may just ask.
Data is made helpful by context. The last generation of tooling worked by stripping away the entire context at write time; when you’ve discarded the context, you may never get it back again.
Also: with logs and metrics, you may have to know what you’re searching for before you could find it. That’s not true of contemporary observability. You don’t must know anything, or seek for anything.
While you’re storing this wealthy contextual data, you may do things with it that feel like magic. We’ve got a tool called BubbleUp, where you may draw a bubble around anything you’re thinking that is weird or is likely to be interesting, and we compute all the size contained in the bubble vs outside the bubble, the baseline, and kind and diff them. So that you’re like “this bubble is weird” and we immediately let you know, “it’s different in xyz ways”. SO much of debugging boils right down to “here’s a thing I care about, but why do I care about it?” When you may immediately discover that it’s different because these requests are coming from Android devices, with this particular construct ID, using this language pack, on this region, with this app id, with a big payload … by now you most likely know exactly what’s incorrect and why.
It’s not only in regards to the unified data, either — although that could be a huge a part of it. It’s also about how effortlessly we handle high cardinality data, like unique IDs, shopping cart IDs, app IDs, first/last names, etc. The last generation of tooling cannot handle wealthy data like that, which is form of unbelievable when you consider it, because wealthy, high cardinality data is the Most worthy and identifying data of all.
How does improving observability translate into higher business outcomes?
That is certainly one of the opposite big shifts from the past generation to the brand new generation of observability tooling. Prior to now, systems, application, and business data were all siloed away from one another into different tools. That is absurd — every interesting query you should ask about modern systems has elements of all three.
Observability isn’t nearly bugs, or downtime, or outages. It’s about ensuring that we’re working on the appropriate things, that our users are having an incredible experience, that we’re achieving the business outcomes we’re aiming for. It’s about constructing value, not only operating. In the event you can’t see where you’re going, you’re not in a position to move very swiftly and you may’t course correct very fast. The more visibility you may have into what your users are doing along with your code, the higher and stronger an engineer you may be.
Where do you see the long run of observability heading, especially concerning AI developments?
Observability is increasingly about enabling teams to hook up tight, fast feedback loops, so that they can develop swiftly, with confidence, in production, and waste less time and energy.
It’s about connecting the dots between business outcomes and technological methods.
And it’s about ensuring that we understand the software we’re putting out into the world. As software and systems get ever more complex, and particularly as AI is increasingly in the combo, it’s more essential than ever that we hold ourselves accountable to a human standard of understanding and manageability.
From an observability perspective, we’re going to see increasing levels of sophistication in the information pipeline — using machine learning and complicated sampling techniques to balance value vs cost, to maintain as much detail as possible about outlier events and essential events and store summaries of the remainder as cheaply as possible.
AI vendors are making numerous overheated claims about how they’ll understand your software higher than you may, or how they’ll process the information and tell your humans what actions to take. From the whole lot I actually have seen, that is an expensive pipe dream. False positives are incredibly costly. There isn’t any substitute for understanding your systems and your data. AI will help your engineers with this! However it cannot replace your engineers.