Maybe this is actually obvious, but it's still a common mistake in startups so I'll say it - you don't have to believe your product is good to start selling; you just have to be better than not having the product. You don't even have to believe it's the best, or believe it's complete, or even like it. People will happily give you money for anything that makes their pain point slightly less painful.
Confluent Cloud is pay by usage (per GB in, per GB out, per GB stored), so its much pricier based on your org's usage. However, it is definitely feature rich.
- Doesn't expose metrics via JMX but does provide a nice tool called confluent control center for monitoring and managing kafka cluster
- Built in and Managed Schema Registry
but
- AFAIK they won't reveal cluster size or cluster version (except client compability). So, they do scaling and upgrades automatically. It's a double edged-sword but should work for lot of orgs.
- Overall, definitely a better product than aws msk.
We considered confluent at my last job, where our throughput was going to be miniscule to start but we required VPC peering...that bumped us into an Enterprise plan for something like 80k/year! Went with Aiven instead which was more like 20k/yr IIRC.
We build a complete drop in replacement for Confluent's metrics/management tooling (https://kpow.io) that fills a very large gap in the engineering experience for Kafka. I say replacement but tbh we offer a lot more in terms of features other than stopping/starting clusters.
It feels obvious to me that AWS will roll-out S3 backed storage and managed Kafka Connect at some point, their recent IAM / ACL integration points to a pretty active MSK team.
I completely agree in the past there were some pretty big gaps - that's why we built kPow - my feeling is those gaps are narrowing pretty significantly.
If you are on GCP I think the choice is simple, use Cloud Pub/Sub. Extremely simple, extremely reliable, extremely performant, fairly inexpensive, multi-region (global). No maintenance, no scaling, almost no tunables, it just works.
Google provides a Pub/Sub emulator for local development.
I don't really buy the vendor lock-in thing for Pub/Sub-like systems. The Cloud Pub/Sub usage pattern is basically the same as Kafka, you can have a library that abstracts away the differences. There are open source libraries that do that[1]. If you ever need to switch cloud providers, or want a messaging system to span cloud providers, you can switch without changing lots of code.
The main driver for Pulsar is that we have a number of different messaging use cases, some more "pub/sub" like and some that are more "log" like. Pulsar really does unify those two worlds while also being a ton more flexible than any hosted options.
For example, Kinesis is really limiting with the limited retention and making it very difficult to do any real ordering at scale due to the really tiny size of each shard.
Similarly, SQS does pub/sub well, but we keep finding that we do need to use the data more than the first initial delivery. Instead of having multiple systems where we store that data we have one.
As for why we didn't go with Kafka, the biggest single reason is that Pulsar is easier operationally with no needing to re-balance and also with the awesome feature that is tiered storage via offloading that allows us to actually do topics that have unlimited retention. Perhaps more importantly for the adoption though is pub/sub is much easier with Pulsar and the API is just much easier to reason about for developers than all the complexity of consumer groups, etc. There are a ton of other nice things like being able to have topics be so cheap such that we can have hundred of thousands and all of the built-in multi-tenancy features, geo-replication, flexible ACL system, pulsar functions and pulsar IO and many other things that really have us excited about all the capabilities
> Please could you try and explain how you came to this conclusion?
1. Stateless brokers
With Kafka any time a broker goes down you need to be aware of the kafka broker id. Yes, this can be fixed by creating your entire infrastructure as code and keeping track of state.
This is something of great OpEx. I've seen few people successfully automate this, Netflix is one of the few. The rest just use manual process with tooling to get around, pager, Kafka tooling to spawn replacement node with the looked up broker id, etc.
2. Kafka MirrorMaker
Granted I have not used v2 that recently came out in ~2.6 but dear gosh v1 was so bad that Uber wrote their own replacement from the ground up called uReplicator. The amount of time wasted on replication broken across regions is disgusting.
3. Optimization & Scaling
Kafka bundles compute & storage. There's (maybe on a upcoming KIP) no way that I know of splitting this. This means you'll waste time on Ops side deciding on tradeoffs between your broker throughput and your broker space.
Worse yet time & money will be wasted here. I'd just rather hire more people than waste time on silly things like this. This is where I justify taking on the expense of client libs.
4. Segments vs Partitions
The major time wasters are where you end up in a situation with the cluster utterly getting destroyed. It will happen, it isn't a question of if but a question of when or the company goes belly up and nobody cares.
It's 3 AM, the producer is getting back pressure, you get a page and now have to deal with adding on write capacity to avoid a hot spot. Don't forget you can't just simply do a rebalancement in Kafka or you'll break the contract with every developer who has developed under the golden rule of, "Your partition order will always be the same".
You'll successfully pay the cost of upgrading the entire cluster and then spending 3 days coming up with a solution to rebalance without making all your devs riot against you when you break that golden contract.
RIP Kafka
Having spent a couple of years dealing with Kafka I'm sorry to burst people's bubbles but is dead. Even Confluent doesn't have a good enough story these days to not switch to Pulsar, they're going to sell you on the same consulting bs, "We're more mature", "We've got better tooling.", "Better suppott"...
Yes, of course, it has been in the open source community 5 years longer and the company has been also around longer for that time. Kafka is dead, long live Pulsar.
If you're in AWS, you can use Kinesis which is similar to Kafka. It also ties into a lot of their other offering such as:
* s3 - use kinesis firehose to take the contents of your kinesis stream and time partition it into files for either ingestion into redshift, elastic search, etc... or later batch analysis for ML or just to treat as cold searchable storage with something like Athena
* dynamoDB - spit out the data into kinesis from dynamoDB as it changes to create a change stream used elsewhere in your platform. (dynamo-streams)
* real time analysis - perform real time sql analysis (kinesis analytics) on what's in your stream over a given window of time or data, and react as events/situations occur.
Looking at all the services that amazon has built around kinesis might help you understand some of the differences between kafka and something like RMQ.