More

shcallaway · 2026-01-06T03:37:35 1767670655

I completely agree w/ your points about why observability sucks: - Too much setup - Too much maintenance - Too steep of a learning curve

This isn't the whole picture, but it's a huge part of the picture. IMO, observability shouldn't be so complex that it warrants specialized experience; it should be something that any junior product engineer can do on their own.

> I can definitely imagine having Claude debug an issue faster than I can type and click around dashboards and query UIs. That sounds fun.

Working on it :)

shcallaway · 2026-01-06T03:35:21 1767670521

Vibe-coders don't comprehend how the code works, yet they can create impressive apps that are completely functional.

I don't see why the same isn't true for "vibe-fixers" and their data (telemetry).

jbs789 · 2026-01-06T07:40:43 1767685243

The distinction is originality vs replicating existing.

I believe the author is in the former camp.

shcallaway · 2026-01-06T03:33:15 1767670395

You're so right.

> We'll have solved the problem when AI detects the problem and submits the bug fix before the engineers wake up.

Working on it :)

shcallaway · 2026-01-06T03:32:39 1767670359

You're not the first person I've met that has articulated an idea like this. It sounds amazing. Do you have an idea about why this approach isn't broadly popular?

donavanm · 2026-01-06T03:57:02 1767671822

cost and compliance are non-trivial for non-trivial applications. Universal instrumentation and recording creates a meaningful fixed cost for every transaction, and you must record ~every transaction; you can't sample & retain post-hoc. If you're processing many thousands of TPS on many thousands of nodes that quickly adds up to a very significant aggregate cost even if the individual cost is small.

For compliance (or contractual agreement) there are limitations on data collection, retention, transfer, and access. I certainly don't want private keys, credentials, or payment instruments inadvertently retained. I dont want confidential material to be distributed out of band or in an uncontrolled manner (like your dev laptop). I probably don't even want employees to be able to _see_ "customer data." Which runs head long in to a bunch challenges where low level trace/sampling/profiling tools have more less open access to record and retain arbitrary bytes.

Edit: Im a big fan of continuous and pervasive observability and tracing data. Enable and retain that at ~debug level and filter + join post-hoc as needed. My skepticism above is about continuous profiling and recording (ala vtune/perf/ebpf), which is where "you" need to be cognizant of risks & costs.

shcallaway · 2026-01-06T03:26:24 1767669984

These are great. I should have included them in my timeline!

Huge fan of historical artifacts like Cantrill's ACM paper

shcallaway · 2026-01-06T03:25:24 1767669924

This is a super insightful comment & there is a bunch that I want to respond to but I can't do it all neatly in one comment. Hahaha

I'll choose this point:

> reliability is still ultimately an incentive problem

This is a fascinating argument and it feels true.

Think about it. Why do companies give a shit about reliability at all? They only care b/c it impacts bottom line. If the app is "reliable enough" such that customers aren't complaining and churning, it makes sense that the company would not make further investments in reliability.

This same logic is true at all levels of the organization, but the signal gets weaker as you go down the chain. A department cares about reliability b/c it impacts the bottom line of the org, but that signal (revenue) is not directly and attributable to the department. This is even more true for a team, or an individual.

I think SLOs are, to some extent, a mechanism that is designed to mitigate this problem; they serve as stronger incentive signals for departments and teams.

donavanm · 2026-01-06T04:17:25 1767673045

I'd +1 incentives, primarily P&L/revenue/customer acquisition/retention, with a small carve out for "culture." I've worked places, and for people, where the culture was to "do the right thing" or focus on user experience as the objective which influenced decisions like paying more (time and money) for better support. For the SDEs and line teams it wasnt about revenue or someone yelling at them, they just emulated the behavior they saw around them which led to better observability/introspection/reliable/support. Which, of course, we'd like to believe leads to long term to success and $$$$.

I also like the call out of SLOs (or OKR or SMART goals or whatever) as a mechanism to broadcast your priorities and improve visibility. BUT I've also worked places where they didnt work because the ultimate owner with a VP title didnt care or understand to buy in to it.

And of course theres the hazard of principal agent problems between those selling, buying, building, and running are probably different teams and may not have any meaningful overlap in directly responsible individual.

shcallaway · 2026-01-06T03:18:11 1767669491

Hello! Yes, you are right - observability and APM have both been around for many decades, but the incarnations that most people are familiar with are the ones that emerged in the 2010s.

My intention wasn't for this post to be a comprehensive historical record. That would have taken many more words & would have put everyone to sleep. My goal was to unpack and analyze _modern observability_ - the version that we are all dealing w/ today.

Good point though!

shcallaway · 2025-11-26T02:42:46 1764124966

Wow, this is great. I have been using Inngest for over a year now and really like the APIs you guys have created for defining step functions / event handlers. I'm very glad to see that this API is now open-source so that it can be adopted more broadly!

shcallaway · 2025-11-21T15:51:36 1763740296

I've been programming for 15 years now, 10 years professionally, and have worked at 5 startups. These are my reflections on the first decade of my career.

Also, an announcement: I’m starting something new!

shcallaway · 2025-10-21T19:07:57 1761073677

Congrats LC team!

I don't have much experience w/ "vanilla" LangChain or the LC Python tools, but I've been an avid user of Deployments and the LangGraph TypeScript SDK for like a year now (started using Deployments back when it was still called "LangGraph Platform"). I think I might be one of the oldest users of both...

To be honest, my first impressions were mixed. The Deployments product was very early/new and rough around the edges (bad monorepo support, for example). And the LangGraph TypeScript SDK felt... not super TypeScript-y. (I get the sense they ported over a bunch of abstractions from the Python package).

But the benefits outweighed the costs. In September 2024, Deployments + the LangGraph TypeScript SDK was one of the only ways that you could say FOR SURE that your agent was going to (a) work and (b) run smoothly in production.

(Also, their team worked hard as hell to ramp myself and my teammates up on agents. I was really won over by this and will be a LC stan forever as a result.)

Over the past year, I've seen both products evolve and mature significantly - to the point where all of those initial problems seem to have been addressed.

I still feel that the overhead associated with building and hosting your own agent from scratch is too much for most teams to take on, even at large companies. (In particular, first-class support for streaming is massive; this would be a huge PITA to build in-house.) I'm happy to let LangChain take care of all of this for me.

Overall, I've been very impressed at LC's ability to iterate and absorb feedback - especially when some of it is... not delivered in the kindest way. They're a very humble and hardworking team. Makes me happy to see them winning.