Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This notion of “stream-table duality” might be the most misleading, damaging idea floating around in software engineering today. Yes, you can turn a stream of events into a table of the present state. However, during that process you will eventually confront every single hard problem that relational database management systems have faced for decades. You will more or less have to write a full-fledged DBMS in your application code. And you will probably not do a great job, and will end up with dirty reads, phantoms, and all the other symptoms of a buggy database.

Kafka is a message broker. It’s not a database and it’s not close to being a database. This idea of stream-table duality is not nearly as profound or important as it seems at first.



Recently I watched a 50-engineer startup allocate more than 50% of their engineering time for about two years to trying to cope with the consequences of using Kafka as their database, and eventually try to migrate off of it. The whole time I was wondering "but how could anyone have started down this path?!?"

Apparently the primary reason they went out of business was sales-related, not purely technical, but if they hadn't used Kafka, they could have had 2x the feature velocity, or better yet 2x the runway, which might have let them survive and eventually thrive.

Imagine, thinking you want a message bus as your primary database.


I worked with a short-lived startup which made this exact mistake. I suggested persisting important events to a postgres db, but I was shot down over and over. They were positive it would be fine to hold everything in kafka. There was this notion that kafka functioned fine as a long term data store with the correct configuration (I'm not arguing that), but there was no solution to the lack of... Well, everything postgres could offer. As you mentioned, the lack of velocity really dragged them down. Customers were constantly upset that x broke or y wasn't delivered yet, and each time it was clear that kafka wasn't helping them meet those needs.

Kafka is awesome for what it does, but there are a ton of people out there using it for weird stuff.


Do you happen to know what sort of thinking leads a team down this path? It seems a fundamental mistake. Resume driven development?


First - you decide to build your data platforms with apache kafka. makes sense as it's a very widely used, durable message bus that does the job well.

Then...you decide to attend a kafka developer conference like Kafka Summit to sharpen your skills and learn best practices.

At that conference, you see the CEO of Confluent give a keynote speech about how Kafka is a transaction store; you 'turn your database inside out', run SQL queries on it, even run your BI dashboards on top of it! They make it seem like this is a best practice that all the smartest engineers are doing, so you better do it as well.

At that point you're investing in a path that will make Confluent millions in revenue trying to consult you on implementing their experimental, up-sell architecture on top of Kafka. Maybe their sales rep will suggest you to buy a $100k/month subscription to Confluent Cloud to really patch up any holes and make it easier to maintain.

Moral of the story is doing your best to separate out the business and the tech side of open source. Behind many open source projects, there's a company trying to make money on an 'enterprise' version.

Ethically, I prefer data companies like Snowflake that are at least transparent about trying to make money off their product. Rather than companies that use open source as a ladder to drive adoption, then try to bait-and-switch you into the same 7-figure deals with software that is more brittle and still requires you to develop your own solutions with it.


In this case the application was commissioned by the founder, and the people developing it were given errant criteria (and not given other valid criteria) which admittedly made Kafka sound like it would make sense.

The developers who adopted the commissioned application were way too inexperienced to know one way or another, and by the time I was asked to help out, no one had really learned anything about Kafka.

My suggestions to replace it with a basic queue were rejected and my suggestions to store events in a database for easier recollection were rejected. They were sure learning Kafka better would fix everything instead.

Ironically there's not even that much to learn about it, in the scheme of things. It's like learning more about your car so you can make it fly one day. It's just not the right tool for the job.


Insufficient asking "why?"

1. Why do we want to use Kafka? [insert thinking] So that we can do x yet avoid y.

2. Why is the customer asking for these features? [insert listening] So that they can do z and think of themselves as w.

And then failure to remember and revisit that "why?".


In addition to what others have answered, I think many devs consistently underestimate the amount of heavy lifting that relational databases do.

And even worse a lot of this heavy lifting isn't obvious until you are fairly far into a project and suddenly realize that to solve a bug you need to implement transactions or referential integrity.


Not just Kafka, either. Mongo, Cassandra, couchbase, Redis, SOLR and elastic search are all mistaken as replacements for an RDBMS.


Cassandra can replace an rdbms in a lot of cases if eventual consistency works for your data.

It's not a drop in replacement though, that's for sure.


Better than the other suggestions, but my experience is that you run into Cassandra's limitations really quick: you can't query on non-primary columns, and you can't join tables without pulling all the data down to the client and merge manually.


> can't query on non-primary columns,

That's incorrect unless I'm misunderstanding what you mean with primary columns.. It's just not as efficient.

And missing joins are by design and one of the reasons cassandra is as fast as it is. And as I said before: it's not a drop in replacement. You need to architect your application around it's strength to leverage it's performance.

It is however usable as a rdbms replacement if you know what you're doing and your data is fine with eventual consistency.

And knowing what cassandra does with your data is important as well. It's actually a key-value store on steroids. Once you get that, it's limitations and strengths become obvious


> That's incorrect unless I'm misunderstanding what you mean with primary columns

I'm referring to "Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING"

> It's just not as efficient.

That's what I thought, too, so I "allowed filtering". And crashed the database. (Apparently the correct solution here is to use Spark).


haha, yes - thats a real possibility.

cassandra is at its heart a key-value store. for every queryable field, you need to create a new in-place copy of the data. so you're basically saving the data n-times, for each way you wish to query the data.

if you however try to query on a field which hasn't been marked as query-able, the cluster will have to basically select everything in the table and do a for loop to filter for the 'where' clause :-)

But i haven't used in production yet, so you've got more experience then i do


My hypothesis: it’s hard to have a descent grasp of technologies without having actually used it. Tie that with “let’s not include too many moving parts” and it’s easy to end up in a situation where the edict is “Kafka”. Let’s say you have never used RDBMS, only used rethinkdb and that turned out to be problematic for whatever reason, next project the founder hires you on the premise that the system you build needs to scale to billions of requests per minute ASAP (eventhough currently there is 0 traffic).


Even though currently there is zero traffic is exactly right. Haha. When this company finally did get customers, the thing they thought would help them manage thousands of high volume customers ended up making it so they could barely retain a few very low volume customers.


Yea, I agree. Having the endgame solution in place at the beginning is often a mistake and can actually foil your ability to get to the end game.


I'll admit, I only recognized that mistake because I've made it myself, over and over. It's hard to push code knowing it'll need improvements later, or knowing how scopes will change. I find myself repeating "perfect is the enemy of good" because I struggle to just let a solution be good enough.

It's tough to be consulted and watch people go against your advice like that, though.


Sometimes I'm annoying is that It seems that some cloud services (I imagine Firebase) only provides endgame solution for datastore.


I think this is worth mentioning again: https://blog.bradfieldcs.com/you-are-not-google-84912cf44afb


> this exact mistake

https://kafka-summit.org/sessions/source-truth-new-york-time...

Both your comment and the GP are using examples of failed startups as basis of your criticism. One mentions "50 engineers" and I'd bet not a single one had an "engineer" bone or training in their body. So I read that as 50 blog-driven programmers fail to execute a sophisticated system design they read about on some blog. Surprise?

Using Kafka as the distributed log of a deconstructed database is technically an order of magnitude more demanding than wiring MVC widgets, or using the latest paint by numbers framework, etc., because that is in fact developing a custom distributed database.

Technical leadership and experience are forgotten things but they do matter. Let's not blame the stack.


I said Kafka is great at what it's for - I'm not blaming it at all. I actually enjoyed working with it a lot, despite that it couldn't help my client very much. Fine tuning it to work better for them was probably one of the more fun projects I've worked on in the last year. It still wasn't the right tool for the job they were trying to do, though.

They could have replaced their brokers with simple worker queues and fed the raw data and results into a db. That was all they needed. The fear was that when they got MASSIVE, that architecture wouldn't scale for them. In reality it absolutely would, but like you said, the blogs told them otherwise.

Kafka is fine, using it for the wrong thing isn't.


If I had a dollar for every project that couldn't make a 1000 users happy because they were too worried about making a million happy.


No, you said it was a mistake to use Kafka as database. It's been pointed out that a key member of the Kafka trio, Jay Kreps, is on record that you can definitely use it as a source of truth. NYTimes example is also of a competent team that uses Kafka as a source of truth.


Fair enough, I should have been more explicit in saying that the team was using it in a weird way for their own purposes. Not only was the use case wrong for them, but the technology wasn't well understood by anyone either. The implementation was partially nonsense.

I'm not criticizing Kafka so much as the way I've seen it used in software. It's largely down to user error, and like I said, Kafka is good software. I had to learn a lot about it for that project and it was a lot of fun. It was my first deep dive into message brokers and I really loved it, so I mean no disrespect to the software or its maintainers.


For the record, I don't think any piece of software is beyond criticism, including Kafka. That said, I frankly have no idea why this sub-thread has meandered into the topic of Kafka the software! :)

You could replace Kafka with Pulsar (or DistributedLog for that matter) in my comments above and it would stand.

Is using a distributed log as the WAL component of a event-sourcing or distributed database entirely a bad idea? I am of the opinion that it is a viable but sophisticated approach.

Peace out.


Theyre not criticising Kafka, they’re criticising misuse of it in use cases it’s not designed for.


> Imagine, thinking you want a message bus as your primary database.

I've built (and sold) an entire company around this architecture, and am working on the next. It can be incredibly powerful, and is what you _have_ to do if you want to mitigate the significant downsides of a Lambda architecture. We successfully used it to power graph building, reporting [1], and data exports, all in real-time and at scale.

But, we didn't use Kafka for two key reasons (as evaluated back in 2014):

* We wanted vanilla files in cloud storage to be the source-of-truth for historical log segments.

This was really important, because it makes the data _accessible_. Applications that don't require streaming (Spark, Snowflake, etc) can trivially consume the log as a hierarchy of JSON/protobuf/etc files in cloud storage. Plus, who the heck wants to manage disk allocations these days?

* We wanted read/write IO isolation of logs.

This architecture is only useful if new applications can fearlessly back-fill at scale without putting your incoming writes at risk. Cloud storage is a great fit here: brokers keep an index and re-direct streaming readers to the next chunks of data, without moving any bytes themselves. Cloud read IOPs also scale elastically with stored data.

Operational concerns were certainly a consideration: we were building in Go, targeting CoreOS and then Kubernetes, had no interest in Zookeeper, wanted broker instances to be ephemeral & disposable (cattle, not pets), etc, but those two requirements are pretty tough to bolt on after the fact.

The result, Gazette, is now OSS (https://gazette.dev) and has blossomed to provide a bunch of other benefits, but serves those core needs really well.

[1] https://github.com/liveramp/factable


I worked on a project recently which had Kafka in it. Our first order of business was to completely remove it. I would bet good money that many projects on Kafka would benefit tremendously from simply removing it. It's wildly unnecessary in 7/10 situations I see it employed.


What was it replaced with?


Not the guy who posted above but I think I have relevant experience. After going through different options for message queuing we have decided that using S3 and object will be our best option for many reasons. Some of those reasons:

- durable

- cheap

- scalable

- we do not need to read messages in a strict order

- using S3's API is super easy

- no operations overhead

- no compute resources (for the queue part)

After doing it some time I think that I was drinking too much cool-aid with streaming services like Kinesis and Kafka and failed to see how easy it is to implement a simple analytics even service using purely S3.


I’m curious, what makes Kafka to be picked in the first place. Perhaps a vast majority of developers are growing up using Kafka for everything for their learning projects as it makes it convenient for rapid development? Maybe they learn about it through blogs, video tutorials and so on, at the cost of other technologies?

For instance, my nephew was heaping praises on Firebase and would use it for all his projects. It took me a while to understand that the convenience of using it masked some of the issues that would only get exposed at scale.

For whatever reason a vast number of developers are entering the workforce believing in Kafka’s capability to solve all problems. And going against that majority will be very tiring and even costly.


Similar story here. One startup's engineer explained to me how they use Kafka (0.8ish) as they transactional database for financial transactions. When I explained to him that Kafka is(was) not transactional he did not believe me.


I worked for a company that made the same mistake but with AWS Kinesis instead of Kafka.

So, much more costly mistake.


A quibble I have with the term "stream-table duality" is that it's not true duality.

You can construct a state (table) from a stream, but you cannot do the reverse. You cannot deconstruct a table into its original stream because the WAL / CDC information is lost -- you can only create a new stream from the table. This means you lose all ability to retrieve previous states by replaying the stream. Information is lost.

Duality in math is an overloaded term but it generally means you can go in either direction. This is not true here.


It depends on your data schema. If you store the data in a temporal form (every row has a transaction period), you can reconstruct a stream with ORDER BY LOWER(transaction history).

One benefit of using the table as a source of truth this way is that you can run migrations on the entire “stream” if you will, and republish the events in order, in the new schema definition.

Another benefit is that you can run SQL queries against your entire transaction history. Some (imo sketchy) tools exist that try to give you a SQL interface to Kafka, but that’s just bad. It’s much easier to have your cake and eat it in postgres than it is in Kafka, imo. Unless your data volume is absurd, in which case Kafka is your only realistic choice.

Of course, in a “literal” sense you’re correct, it’s not possible. But from a practical point of view, if your bit empor all data implementation is stable, you can totally depend on your ability to produce streams from a database. We do this at my current place of work all the time.


Just came here to post this comment. I feel like they must've known this when creating the terminology, so I can only imagine they called it "duality" because they wanted it to sound cool. But it's most definitely not duality and I would wager calling it that is actively harmful. Last thing you want is someone implementing a table under some misled impression that it can do everything a stream can!


This Confluent blog is a rehash of earlier conceptual position statement of Jay Kreps, who is now the boss of the Confluent blogger.

https://engineering.linkedin.com/distributed-systems/log-wha...

Pointing out (GP and yourself) that this is an 'asymmetric duality' is informative and valid.


Thank you -- it's food for thought.

A symmetric dual is one where there is equivalence: the dual of the dual is the primal.

But it is possible to have a "weak" duality that is assymmetric, where the dual encompasses the primal -- where there are regions of equivalence instead of a 1-to-1 equivalence.

I'm not sure if stream-duality fits this conceptual mold exactly but thank you for providing a counterpoint to my assertion. I'm going to think on it.


If one computes in a bounded time domain, then it is symmetric: a table is a ‘time slice’ of a stream.


so I can only imagine they called it "duality" because they wanted it to sound cool

Also, that you believing it is the most profitable mindset - for them. Company blogs are marketing, first and foremost.


I read it as “there exists a event schema and some database table for which they can be co-constructed”, not “every database table can be turned into a stream of events.”

I don’t have a proof, but For things like state transitions, I’ve often solved this with a <new_state>_at column. Seeing that field populated would indicate a second event need be emitted. (Assumes no cycles in the state machine.)

Often that table is not the one you have, even if (perhaps especially if) you emitted events for every update — seems like some entropy-like force pushes the universe toward emitting details in events that aren’t reflected in the table records, over a long enough timeline.


Couldn’t they just make the stored data bitemporal?


SQL temporal tables?

Difference algebra over the deltas could find some interesting relevance.


I've heard Haskell programmers talk about Data and Codata, and came away with the loose sense that Data might be like state and Codata might be like a stream.

That one would so confidently assert that "Codata is the categorical dual of Data" has me thinking there might be something here?

Anyone want to tell me how far off I am?


Data and codata are dual, but that doesn't mean they are the inverse of each other. It means we can formalize them both in category theory and they have the same formalization with reversed arrows.

More usefully, data is a finite data type and codata is an infinite data type.

Some example:

    sum = foldr (+) 0
    sum [0,1,2,3]
    > 6
    sum [0..]
    ...loops forever...

    sum' = scanl (+) 0
    sum' [0,1,2,3]
    > [0, 1, 3, 6]
    sum' [0..]
    > [0, 1, 3, 6, 10...
Haskell doesn't actually distinguish between finite and infinite data. Some functions work fine on infinite data, some functions loop forever.

Induction describes how to process data without diverging - process each input element in a finite time.

Co-induction describes how to process codata without diverging - produce each output element in a finite time.

This is useful if you want to prove things about infinite processes like operating systems.


> This notion of “stream-table duality” might be the most misleading, damaging idea floating around in software engineering today.

No. The notion of a "stream-table duality" is a powerful concept, that I've found can change the way any engineer thinks about how they are storing / retrieving data for the better (it's an old idea, rooted in Domain Driven Design, but for some reason a lot of engineers, myself included, still need to learn, or relearn, it).

The notion that relying on a stream as the primary data persistence abstraction or mechanism in a production is the misleading part, at least for now. I'd argue Kafka pushes us in a direction that makes progress along that dimension, and you can apply it successfully with a lot of effort. But to match what you can get from a more traditional DBMS? The tech just isn't there (yet).


We've been relying on streams since the advent of write ahead logs but the complexity of streaming state transformations was re-formed into something that is easier to reason about. The most common form is a relational DB which is generally users asking questions against snapshots of the WAL. RDB's is a highly opinionated layer of abstraction on top of data streams (in the form of inserts and updates).


Thanks, this is the take I was looking for.

I've definitely seen table-focused projects that really struggle because they need to be thinking in stream-focused terms. That's especially true in the age of mobile hardware. E.g., you want to take your web-focused app and have it run on mobile devices with imperfect connectivity so people in the field can use it, table-only thinking is a pain. It's much easier to reconcile event streams than tables.

Sure you can't expect a stream-oriented tool to do everything an SQL database does. But the reverse is true as well. If we threw out every tool because people unthinkingly used it for the wrong job, we wouldn't have many tools left.


I always found it ironic that you get most of this for free if you design your sql updates and save/query the transaction log and/or history. A lot of relational dbs have functionality for that.

And if you don't want to use that, there's also products for this specifically, such as event store.


That’s true, but you have to interpret the state changes from your table/row schema. Why not just update the DB and then publish an event?


because things can and will fail inbetween updating the db and publishing the event leading to inconsistency between your database and kafka. a better approach would be to update the db and then use something like debezium or maxwell to pull the updates from the db's bin log and publish them to kafka.


> update the DB and then publish an event

It can make sense to do both, so to de-couple internal table structures from externally exposed events. Debezium supports you safely doing this via the transactional outbox pattern [1]. Events are inserts into an outbox table in the same database as the actual business data, both updated in one transaction. Debezium then can be used to capture the outbox events from the logs and send the events to Kafka. To me this hits the sweetspot for many use cases, as you avoid the potential inconsistencies of dual writes which you mention while still not exposing internal structures to external consumers (for other use cases, like updating a cache or search index on the contrary, CDC-ing the actual tables may be preferable).

Disclaimer: I work on Debezium

[1] https://debezium.io/documentation/reference/configuration/ou...


thanks for your work on debezium! cdc to kafka has been a game changer for our data pipelines, so much easier to make generalized tooling at this level. our old pipeline requires either immutable rows with autoincrement primary keys or last modified dates with a secondary index on every table. it doesn't capture all the changes for example if a row is modified twice before the pipeline sees it the first update won't be recorded in the history. it also doesn't propagate deletes. debezium solves both of those issues plus doesn't require the application developers to do anything special for their data to get into the analytics systems. now instead of sounding like a broken record telling people to add last modified dates to all their tables i can sound like a broken record telling people to stop violating single source of truth by dual writing (which is a much better problem to have as before i had given up on single source of truth even being possible in a large organization)


That's awesome, very happy to hear! Drop by any time on the mailing list or chat if you got any issues or questions to discuss around Debezium.


This is a neat pattern, but it honestly just seems like you’re trading where you want the complexity. The outbox pattern would force devs to not use SQL directly for mutations - in so much as they can’t directly modify records like they normally would. I can see this being ok if you had someone with strong DBA skills implementing the abstraction.


Devs modify records just as they normally would. Only, in addition, they produce a record in the outbox table, representing the change to the outside world. I.e. instead of writing to the DB and Kafka, both changes are done to the database, and then events from the outbox table are CDC-ed to Kafka.


We do this now and make publishing to Kafka a requirement of transaction success, rolling back the DB transaction if publishing fails. The dual write is much more straightforward in my opinion and the pattern works as long as the primary database supports transactions or some other form of consistency guarantee.

I still can’t get behind using CDC via a DB watching tool unless I have little control of the system writing to the DB. There is so much context lost interpreting the change at the DB (who was the user that made this change, etc.) — unless you are sending this data to the DB (then yikes your DB schema is much more complex).


> rolling back the DB transaction if publishing fails.

If your DB transaction fails after the Kafka message has been sent, you'll end up with inconsistencies between your local database state and what has been published externally via Kafka. That's exactly the kind of dual write issues the outbox pattern suggested in my other comment avoids.

> unless you are sending this data to the DB

You actually can do this, and still keep this metadata separately from your domain tables. The idea is to add this metadata into a separate table, keyed by transaction id, and then use downstream stream processing to enrich the actual change events with that metadata. I've blogged about an approach for this a while ago here: https://debezium.io/blog/2019/10/01/audit-logs-with-change-d....


We don’t use auto commits with Kafka so it’s not a problem for us. Of course, this does reduce our publishing throughput - though that’s not a problem for us at our current scale.

Either way, I’m intrigued by the outbox pattern. As long as it is transparent to developers, it does seem like an ideal CDC mechanism.


I always found it ironic that you get most of this for free if you design your sql updates and save/query the transaction log and/or history. A lot of relational dbs have functionality for that

Being able to elegantly and efficiently do "AS OF" queries is still a hard problem, I don't see that KTables really solve it. Hate to say it but Oracle's solution to this is still unmatched. https://oracle-base.com/articles/10g/flashback-query-10g#fla...


I have recently analyzed Kafka as a message bus solution for our team. My take is that Kafka is a very mature and robust message system with clearly defined complexity and thus when used correctly, very dependable. The recent versions even have online rebalancing (but don't tell that to the Pulsar people). However "Kafka Streams" just make me shiver - I don't understand why someone would masquerade a project directly competing with other processing frameworks (e.g. Apache Flink or Spark) as being part of Kafka itself, when apart from utilizing Kafka's Producer and Consumer interfaces it really has nothing else in common with it. Streams is a giant, immature, complexity-hiding mess which as a responsible systems designer you have to run away from really, really fast.


The point of that is for people trying to figure out why streams are a useful abstraction at all what's needed to make them useful are some sort of aggregation, and of course tabular state is a common end point.

The article does not recommend writing this code yourself, it shows how to aggregate data into usable forms.

So I think your concerns may be a bit overblown. If you think that ksqlDB or Kafka Streams, the tools shown in this blog post, are are at risk for what you warn, this comment would be a valid criticism. But it's clear that the article isn't advocating for people to write their own versions of that...


Yes, they're definitely at risk. ksqlDB does not appear to have transactions at all.


The transaction boundary is the message/event itself (in the case of an event sourced system, which is what this would primarily be used with). Everything required to understand some event is contained within the payload. It is the atomic unit in this type of system.


I worked on a project that used CQRS and Event Sourcing years ago. It was an unmitigated disaster, never made it out of prototype phase.

By using Kafka (or anything else) as a "commit log" you've just resigned yourself to building the rest of the database around it. In a real RDBMS the commit log is just a small piece of maintaining consistency.

Every time I've worked on a project where we handle our own "projections" (database tables) we ended up mired deep in the consistency concerns normally handled by our database.

Whats so hard? Compaction is an evil problem. We figured we could "handle it later". Well it turns out maintaining an un-compactible commit log of every event even for testing data consumes many terabytes. This and "replays" sunk the project indirectly.

Replays. The ES crowd pretends this is easy. It's not. If you have 3 months of events to replay every time you make certain schema changes you are screwed. Even if you application can handle 100X normal load, its going to take you over a day to restore using a replay. This also happened to us. Every time an instance of anything goes down, it has to replay all the events it cares about. If you have an outage of more than a few instances the thundering herd will take down your entire system no matter how much capacity you have. With a normal system, the Db always has the latest version of all the "projections" unless you're restoring from a backup, then it only has to restore a partial commit log replay, and only once.

Data duplication. Turns out "projecting" everything everywhere is way worse on memory and performance than having the DB thats handling your commit log also store it all in one place. Who knew.

Data corruption. If you get one bad event somewhere in your commit log you are cooked. This happens all the time from various bugs. We resorted to manually deleting these events, negating all the supposed advantages of an immutable commit log. We would have been fine if we let the db handle the commit log events.

Questionable value of replays. You go through a ton of bs to get a system that can rewind to any point in time. When did we use this functionality? Never. We never found a use-case. When we abandoned ES we added Hibernate Auditing to some tables. Its relatively painless, transparent, and handled all of our use cases for replays


> This notion of “stream-table duality” might be the most misleading, damaging idea floating around in software engineering today.

I disagree. Kafka is a log. That's half of a database. The other half is something that can checkpoint state. In analytics that's often a data warehouse. They are connected by the notion that the data warehouse is a view of the state of the log. This is one of a small number of fundamental concepts that come up again and again in data management.

Since the beginning of the year I've talked to something like a dozen customers of my company who use exactly this pattern to implement large and very successful analytic systems. Practical example: you screw up aggregation in the data warehouse due to a system failure. What to do? Wipe aggregates from the last day, reset topic offsets, and replay from Kafka. This works because the app was designed to use Kafka as a replayable, external log. In analytic applications, the "database" is bigger than any single component.

I agree there's a problem with the stream-table duality idea, but it's more that people don't understand the concept that's behind it. The same applies to transactions, another deceptively simple idea that underpins scalable systems. "What is a transaction and why are they useful?" has been my go-to interview question to test developers' technical knowledge for over 25 years. I'm still struck by how many people cannot answer it.


> However, during that process you will eventually confront every single hard problem that relational database management systems have faced for decades. However, during that process you will eventually confront every single hard problem that relational database management systems have faced for decades. You will more or less have to write a full-fledged DBMS in your application code. And you will probably not do a great job, and will end up with dirty reads, phantoms, and all the other symptoms of a buggy database.

It's the same as moving from a framework to a library. At first, the framework seems great, it solves all your problems for you. But then as soon as you want to do something slightly different from the way your framework wanted you to do it, you're stuffed. Much better to use libraries that offer the same functionality that the framework had, but under your own control.

If you use the sledgehammer of a traditional RDBMS you are practically guaranteed to get either lost updates, deadlocks, or both. If you take the time to understand your dataflow and write the stream -> table transformation explicitly, you front-load more of the work, but you get to a database that works properly more quickly.

99% of webapps don't use database transactions properly. Indeed during the LAMP golden age the most popular database didn't even have them. The most well-designed systems I've seen built on traditional databases used them like Kafka - they didn't read-modify-write, they had append-only table structures and transformed one table to compute the next one. At which point what is the database infrastructure buying you?

If you're using a traditional database it's overwhelmingly likely that you're paying a significant performance and complexity cost for a bunch of functionality that you're not using. Given the limited state of true master-master HA in traditional databases, you're probably giving up a bunch of resilience for it as well. If Kafka had come first, everyone would see relational SQL databases as overengineered, overly-prescriptive, overhyped. There are niches where they're appropriate, but they're not a good general-purpose solution.


> If Kafka had come first, everyone would see relational SQL databases as overengineered, overly-prescriptive, overhyped. There are niches where they're appropriate, but they're not a good general-purpose solution.

Oh ho, there's some deep irony about calling a database over engineered when we're looking at kafka.

I'm running a copy of sqlite right here, and it's pretty nice; it does queries and it's portable, it even comes as a single .c file that I can drop directly into a project. I use the same API for my mobile app.

Where's the equivalent for kafka?

No? I have to install java, and run several independent processes (forget embedded), and configure a cluster to... um... do what? send some messages? If I want to do more than that I have to add my own self implemented logic to handle everything.

Different things you argue; but alas, that's the point; the point you made was that kafka would 'be it' if it had come first, but...since it cannot serve the purpose of a db in most contexts; you're just wrong.

It's a different thing, and it might be suitable for some things, but I have deployed and managed (and watched others manage) kafka clusters, and I'll tell you a secret:

Confluent sells managed kafka. They have a compelling product.

You know why? ...because maintaining a kafka cluster is hard, hard work, and since that's the product confluent sells, they're not about to fix it. Ever.

Now, if we're talking about event-streaming vs tables; now there's some interesting thoughts about what a simple single node embedded event streaming system might look like (hint: absolutely nothing like kafka), and what an alternate history where that became popular might have been...

...but Kafka, this kafka, would never have filled that role, and even now, it's highly dubious it fills that role in most cases; even, even if we acknowledge that event streaming could fill that role, if it was implemented in a different system.


> Different things you argue; but alas, that's the point; the point you made was that kafka would 'be it' if it had come first, but...since it cannot serve the purpose of a db in most contexts; you're just wrong.

SQLite can't serve the purpose of a db in most contexts either; not only can it not be accessed concurrently from multiple hosts, it can't safely be accessed from different hosts at all. It can't even concurrently be accessed from multiple processes on the same host. If we both took a list of general-purpose database use cases, and raced to implement them via Kafka or via SQLite, I guarantee the Kafka person would finish a lot quicker.

> You know why? ...because maintaining a kafka cluster is hard, hard work, and since that's the product confluent sells, they're not about to fix it. Ever.

Maintaining a Kafka cluster sucks. So does maintaining a Galera cluster (I've done both), and I'll bet the same is true for any other proper HA (master-master) datastore.

Comparing Kafka against "SQL databases" is a category error, sure. But in the big use case that everyone talks about, that all the big-name SQL databases focus on - as a backend for a network service that runs continuously where you don't want a single point of failure - those big-name SQL databases are overengineered and underwhelming compared to Kafka, and in most other use cases an SQL database is overengineered compared to a (notional) similarly-designed event-sourcing system for use with the same sort of problems.


Sqlite is quite literally the most popular piece of software in the world. I find it a bit ironic to say that it's not fit for most purposes.


JQuery is similarly popular. Does that mean it's fit for most purposes?


I have run several production web apps on sqlite, it’s fine unless you have enough traffic where you really need concurrent requests. If you do happen to get 2 requests exactly the same time one has to wait, this happens in “real” rdbms systems under lock conditions anyway. I think you’d be surprised how much traffic it can handle.

A huge advantage is backing up and restoring from backup with sqlite is trivially implemented with scp in a bash script. Ever tried to setup database restore from backup (that works) in AWS?

There are some real wins from using sqlite in prod.


> I have run several production web apps on sqlite, it’s fine unless you have enough traffic where you really need concurrent requests. If you do happen to get 2 requests exactly the same time one has to wait, this happens in “real” rdbms systems under lock conditions anyway. I think you’d be surprised how much traffic it can handle.

I've run apps that were written like that as well (though not by me). What you say is true, as far as it goes, but those apps have no business using an SQL database: they don't need ad-hoc queries (and usually don't offer the possibility of doing them), they don't need the SQL table model (and usually have to contort themselves to fit it), they don't need any but the crudest transactions... frankly in my experience all the developer needed was some way to blat structured data onto disk, and they reached for SQLite because it was familiar rather than because they'd carefully considered their options.

> A huge advantage is backing up and restoring from backup with sqlite is trivially implemented with scp in a bash script. Ever tried to setup database restore from backup (that works) in AWS?

You're not comparing like with like though: if you're willing to stop all your server processes and shut down your database to do the backup and restore (which is what you have to do with sqlite, you just don't notice because it's something you do all the time anyway) then it's easy to back up practically any datastore.


it seem funny to compare kafka to sql for a OLTP use case. The main benefit of SQL are index and Transactions.

If you don’t need them you might as well use a file system directly but if you need it, there is not much alternative :-)


Filesystems are a pain in various ways. The only decent HA option is AFS and it has performance issues and general awkwardness. In a lot of programming languages the filesystem APIs are surprisingly bad. And it's still really easy to have lost updates etc.


if you care about lost updates you should use transaction or at least an API that support optimistic concurrency control.

Posix File system API don’t offer much for concurrency control other than lock.

In my line of work we use HDFS or AWS S3 as “filesystem” so we are not using the OS file access API at all and most File are append only by a single server.


I don't understand - you suggested using the filesystem, but filesystems don't have good concurrency control. S3's concurrency handling is even worse than filesystems. Something like Kafka is a much better way to do a stream of commands that need to be transformed to eventually produce some kind of result (which is really almost all workloads).


No this was a misunderstanding I said if you need (concurrency control/transaction and index) use a real database not a filesystem, since it’s the whole reason database exist.

But if you don’t need it there are plenty of good distributed file system that will not eat your data.


> No this was a misunderstanding I said if you need (concurrency control/transaction and index) use a real database not a filesystem, since it’s the whole reason database exist.

Just because something is designed for x doesn't mean it's any good at x. Kafka is also designed for handling concurrent events, and it's significantly better at it.

> But if you don’t need it there are plenty of good distributed file system that will not eat your data.

There really aren't any distributed filesystems that are nice to work with. S3 isn't a full filesystem and your writes take a while before they're visible in your reads, HDFS isn't a full filesystem and can be slow to failover, AFS has generally weird performance and locks up every so often. If your needs are simple then anything will work, including distributed filesystems, but it's very rare that a distributed filesystem is an actively good choice, IME.


I disagree pretty strongly with this. Things like indexes, uniqueness constraints, foreign keys, basically data schemas are hugely useful. There was a big movement for awhile in web dev, to use mongo instead of “the sledgehammer of an rbdms” , people learned over time that actually they wanted a database. Because you run into so many issues with querying and data consistency or maybe you want to do a join. All of these things become huge problems without a database. An eventually consistent log of data is not a replacement for a database. Can you imagine designing a simple crud app with a couple of tables and different products, but now do it with Kafka as your backing store. You’ve added so much more complexity to your application because you didn’t want to use a database.


> Things like indexes, uniqueness constraints, foreign keys, basically data schemas are hugely useful. There was a big movement for awhile in web dev, to use mongo instead of “the sledgehammer of an rbdms” , people learned over time that actually they wanted a database. Because you run into so many issues with querying and data consistency or maybe you want to do a join.

Schemas are a good idea, but SQL is bad at expressing them. Uniqueness constraints and foreign keys seem like a good idea until you think about how you want to handle them; often a stream transformation gives you better options. Unindexed joins and indexed joins are very different, and it's much easier to understand the difference in a Kafka-like environment than in an SQL database.

> Can you imagine designing a simple crud app with a couple of tables and different products, but now do it with Kafka as your backing store.

Yes, that's exactly what I'm advocating. It's clearer and simpler and avoids a bunch of pitfalls - no lost updates, no deadlocks, your datastore naturally has decent failover.

> You’ve added so much more complexity to your application because you didn’t want to use a database.

On the contrary, if you use a database you add a huge amount of complexity that you don't want or need. You have to manage transactions even though you can't actually gain any of their benefits (if two users try to edit the same item they're going to race each other and one's update is going to get lost). You have to flatten your values into 2D tables and define mappings with join tables etc., rather than just serialising and deserialising in a standard way. Bringing in new indicies in parallel is flaky, as is any change to data format. And getting proper failover set up is an absolute nightmare, to the point that a lot of people will set up failover for the webserver but leave the database server as a single point of failure because it's too hard to do anything else.


When you want to support random writes and random reads you don't need any "query" language. You need a dumb and performant key value store which is optimized for these and is easy to distribute to multiple machines.

Querying can be ad hoc - ie you don't know what your query will be and in that case you want to offload your data to a olap system which will perform about 1 or 2 orders of magnitude better than your typical row oriented mysql database. Moving the data out of your primary database will prevent you from deadlocks and shooting yourself in a foot.

If you know the query beforehand you simply compute and update a materialized index with results of your query.

Buulding system is such a decoupled way is conceptually and impelementationally much simpler than your typical overcomplicated sql database with unpredictable execution plans. These databases are good only if 1. you don't know what you are doing 2. you need some quick and dirty system to work without thinking which and you are fine that it will never scale up. Many small CRUD ecommerce apps are like that and these databases get used because that is what your typical medium.com article recommends as the best practice.


While i agree with all you said materialized index are not enough you also need transaction that span more than one entity.

You can’t simply put all entity you might need to update atomically in the same key/document.


As long as you have a clear dataflow you don't need transactions, you just need to make sure there are no loops in your dependencies. And you have to do that with transactions as well, otherwise they'll deadlock - the difference is that if you make a mistake you get to find out in production rather than when designing.


I would be very curious to know how someone could design an airline Computer reservation system this way.

I agree that when using a SQL database that use locks instead of snapshot isolation you need to be careful about the order you scan row and join table to reduce frequency of dead-lock.

My experience using PostgreSQL is that deadlock are extremely rare and when they happen they are automatically detected by the database and client simply abort and retry the transaction when this happen.


> I would be very curious to know how someone could design an airline Computer reservation system this way.

You make each operation an event, you make all your logic into stream transformations, and you make sure the data only ever flows through in one direction (and you have to either avoid "diamonds" in your dataflow, or make sure that the consequences of a single event always commute with each other).

So at its simplest, someone trying to make a reservation would commit a "reservation request" event and then listen for a "reservation confirmed" event (technically async, but it's fast enough that you can handle e.g. web requests this way - indeed even with a traditional database people recommend using async drivers), and on the other end there's a stream transformation that turns reservation requests (and plane scheduling events and so on) into reservation confirmations/denials. Since events in (the same partition of) a single stream are guaranteed to have a well-defined order, if there are two reservation requests for the last seat on a given plane then one of them will be confirmed and the other will be rejected because the plane is full.

> My experience using PostgreSQL is that deadlock are extremely rare and when they happen they are automatically detected by the database and client simply abort and retry the transaction when this happen.

That's no good. If you do that then you haven't solved the problem and you're guaranteed to get a full lockup as your traffic scales up. You have to actually understand which operations have caused the deadlock and rearrange your dataflow to avoid them. (Not to mention that if it's possible for the client to "simply abort and retry" then you've already proven that you didn't really need traditional database-style synchronous transactions).


100% agree and that’s why google Spanner, Cockroachdb and all the NeeSqL exist.

People found that they miss the abstraction that SQL and transaction offer as it simplify your code a lot. They simply needed something that scale better than doing manual sharding on top of mysql :-)


It's worth noting that most managed Kafka solutions offer a service which will transform a stream into a database table so you can query it later. We've also built our own tooling to do it. We read from the Kafka stream directly when we're building a pipeline between microservices; for almost everything else we query the database. It works pretty well and we haven't found ourselves wishing we could write pseudo-queries on streams yet.


At the moment you more or less need to write a DBMS in your app code, but I don't think that's the end state. I think what we're seeing the beginnings of something big - it just might not seem like it yet because it's the v1 / no where near complete version. I think having all your data in a single system (Kafka, KsqlDb, ..) that allows you to work with it in cross paradigm ways will turn out to be very compelling.


> At the moment you more or less need to write a DBMS in your app code

Since we're discussing misunderstandings in the community, it should be pointed out that a Database Management System (DBMS) is not merely a database, to say nothing of a data store. Oracle, Postgres, et al are genuine "DBMS". ATM you are very likely putting together a data store in your app code.


You're conflating DBMS with RDBMS.

I interpreted the OP as DBMS=database, which absolutely includes application code that stores and retrieves data in proprietary formats.

Even a linked list mmap'd to disk can be a database, just maybe not a very good one.


A RDBMS is simply a Relational Database Management System. A quick tour of CS history reveals ancient curiosities such as Hierarchical Database Management Systems. And there is more:

https://www.studytonight.com/dbms/database-model.php


There are many kinds of databases. Graph, key/value, relational, hierarchal etc. That doesn't change the fact that any app that writes code to store and retrieve data is creating their own database or using someone else's.


There are many kinds of data models. A DBMS is a DBMS, and typically a specific DBMS supports a specific data model. A log structured file is at most a data store.

To be fair this is a somewhat fuzzy categorization (DBMS, DB, Data Store) and it can cause confusion. An DBMS is a system, just like it says right on the acronym tin. A DBMS is a system that provides capabilities such DDL, DML, auxiliary processes, etc. to manage a database. There are various data models, e.g. a triple store, for databases, but the data model X is a orthogonal to XDBMS.


Meteor did some pretty interesting work with this by tailing mongodb’s oplog. Definitely still room for innovation.


>You will more or less have to write a full-fledged DBMS in your application code.

Yes, but only if your data is highly relational, and needs to be operated on in that way (e.g. transactionally across records, needing joins to do anything useful, querying by secondary indices, etc.). Depending on your domain, this could be a large amount of your data, or not. Just like anything, Kafka and Kafka Streams are tools that can be used and misused.

There is perhaps an exciting future where something like ksqlDB makes even highly relational data sit well in Kafka, but we're definitely not there yet, and I'm not terribly confident about getting there.


> Yes, but only if your data is highly relational, and needs to be operated on in that way (e.g. transactionally across records, needing joins to do anything useful, querying by secondary indices, etc.).

Ah, you mean that if the data is in a normalized form. Relational data means, basically, data in tabular form. Tabular data can be normalized or not.

> Depending on your domain, this could be a large amount of your data, or not.

Assuming you are talking about data in normalized form, that's orthogonal to the domain. That's in fact the major breakthrough of relational databases–that data across domains can be modelled as relational and normalized.

Now, in the relatively rarer cases that you don't care about normalizing data, a stream can be useful. But in the vast majority of cases IMHO, we should be looking at a proper relational database like Postgres before thinking about streams.


I agree that _normalized_ is a more precise term for what I'm describing, I only used relational because contemporary discourse seems to favor that term in these kinds of discussions and seems to imply a high degree of normalization.

>Assuming you are talking about data in normalized form, that's orthogonal to the domain.

I don't agree with this though. Nothing about a software architecture is _ever_ orthogonal to the domain.


I work on a legacy CouchDB application and the same problem exists in that area. CouchDB is actually really amazing and is wonderful for some types of applications. It really sucks if you want a relational DB.

Keep in mind that I don't know anything about Kafka, but it seems like the same kind of thinking. You can think about your data as a log of events. Since you can always build the current state from the log of events, then you might as well just save the log. This works incredibly well when you treat your data as immutable and need to be able to see all of your history, or you want to be able to revert data to previous states. It also solves all sorts of problems when you want to do things in a distributed fashion.

The application which fits this model well and that almost every develop is familiar with is source control (and specifically git). If that is the kind of thing that you want to do with your data, then it can be truly wonderful. If it's not what you want to do, then it will be the lodestone that drags you underwater to your death.

The legacy application I work on is a sales system and actually it's not a bad area for this kind of data. You enter data. Modifications are essentially diffs. You can go back and forth through the DB to see how things have changed, when they were changed and often why they were changed. In many ways it is ideal for accounting data.

But as you say, if you want a full fledged DBMS, this should not be your primary data store. It can act at the frontend for the data store -- with an eventually consistent DB that you use for querying (and you really have to be OK with that "eventually consistent" thing). But you don't want to build your entire app around it.


> Kafka is a message broker. It’s not a database and it’s not close to being a database.

Kafka is somehow a database, even if specialized on logs of events. For a similar discussion, I would refer to Jim Gray "Queues Are Databases" https://arxiv.org/pdf/cs/0701158.pdf.


I fully agree that this "stream-table duality" is misleading or at least quite vague.

The diagram juxtaposing a chess board (the table) and a sequence of moves (the stream) seems enlightening but actually hides several issues.

* An initial state is required to turn a stream into a table.

* A set of rules is required to reject any move which would break the integrity constraints.

* On the reverse direction, a table value cannot be turned into a stream. We need an history of table values for that.

Hence, the duality, or more accurately the correspondence, is not between stream and table but between constrained-stream and table history.

Even if these issues are not made explicit, the Confluence articles and the Kafka semantics are correct.

* "A table provides mutable data". This highlights that stream-table-duality is relative to a sequence of table states.

* "Another event with key bob arrives" is translated into an update ... to enforce an implicit uniqueness constraint on the key for the tables.


Reminds me of that time when database vendors were overreaching to be message queues/brokers - OracleAQ, MSMQ, etc.


That's why ksqlDB exists and handles all that for you, turning streams into tables that you can query.


ksql does not solve any of the hard consistency or contention problems you will face if you attempt to use Kafka as a datastore. Consider the simplest possible example: you write an "update event" to a topic and then read a ksql view of that topic. The view may or may not yet reflect the update. This is called read-after-write consistency, and you will need to create it in your application code.


It's not a real database and doesn't promise ACID though. I think it's fine with the understanding that it's an incremental eventually-consistent materialized view engine that works seamlessly with Kafka, especially if you're already a heavy user.

Otherwise loading data into a real database/data warehouse and using ETL with triggers/views/aggregations is better if you need advanced querying rather than a simple stream to table transforms.


The same thing happens if you write to an RDBMS master and then read from a read-only replica.


Allow me to submit that this obsession is... Kafka-esque? :D



Not super familiar with Kafka, but behind the scenes some RDBMS are log replay engines. Are there technical reasons to not slap a RDBMS frontend on top of a Kafka log and avoid dealing with raw streams in the application code?

Edit: From the sibling comments, the answer is https://ksqldb.io.


The technical reason is the inherently distributed and shared nature of Kafka setups makes writing an ACID-compliant frontend (basically?) impossible.

If your application assumes ACID compliance, it will fail. Also, I haven’t looked into this one as directly, but I imagine that query latency compared to an RDBMS would be quite poor.

IMO, most people reach for Kafka too soon, where an RDBMS with a proper schema that includes transaction and valid history would better suit the domain. It’s possible to write very efficient and ergonomic “append-only”, temporal data schemas in Postgres schemas.


But there are a lot of usecases where ACID is not needed and we can do with eventually consistent states. Btw, you do get C and D.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: