We will eventually add the SERIALIZABLE isolation level to OrioleDB, but right now that's not our highest priority. Let me explain why. At first, SSI (serializable snapshot isolation) in PostgreSQL comes with significant shortcomings, including.
1) Overhead. SSI implies a heavyweight lock on any involved index page or heap tuple (even for reads). The overhead of SSI was initially measured at ~10%, but nowadays, scalability has gone much farther. These many HW-locks could slow down in multiple times a typical workload on a multicore machine.
2) SSI needs the user to be able to repeat any transaction due to serialization failure. Even a read-only transaction needs to be DEFERRABLE or might be aborted due to serialization failure (it might "saw impossible things" and needs a retry).
In contrast, typically it's not hard to resolve the concurrency problems of writing transactions using explicit row-level and advisory locks, while REPEATABLE READ is enough for reporting. Frankly speaking, during my whole career, I didn't see a single case where SERIALIZABLE isolation level was justified.
I had a case just recently where we needed to enforce a no-overlap constraint too complicated to express in an index (recurring time ranges).
Any time you have to check constraints manually, you can't just do it before the write, or after the write, because two REPEATABLE READ write transactions will not see each other's INSERT.
You need something like a lock, a two-phase commit, or SERIALIZABLE isolation for writes. Advisory locks have sharp edges, and 2PC is not so simple either, there is a lot that can go wrong.
In the case of SERIALIZABLE you do need to retry in case of conflict, but usually the serialization anomalies can be limited to a reasonably fine level. And an explicit retry feels safer than risking a livelock situation when there is contention.
This seems like a fascinating problem domain! I'm incredibly curious what product you're working on - if you're taking this much attention to detail on recurring events, it seems like it's a step beyond most scheduling systems I've seen.
(I've had to implement the naive approach here, which nested-loops over dates and rules and finds ones that might conflict. Definitely not meant to scale, and would be a nightmare if not at a date-level granularity!)
No pressure to share, but would love to learn more!
It's a scheduling system in the medical field where recurring events can happen in a particular room, this might be every N days at a particular time, or every three weeks on Tuesday and Friday between some start date and some end date.
I'm not entirely familiar with the literature, so we did some simplifications to keep it manageable, and it's very possible there might be a better approach!
We model the events generated by our recurring rules as intervals [SA+n*A, SA+n*A+DA), where A is a constant integer number of days (e.g. A=3*7 for an event on Tuesday every 3 weeks), SA is the start date, and DA is the duration of an event (... which can be >24h to make things more interesting). It might continue recurring forever, or it might stop by some end date.
Now if you have two of these recurring events with periods A and B, you can think of each of them as a periodic function, and the combination of them will have a period equal to LCM(A, B). So we could check everything modulo LCM(A, B) once, and that would hold infinitely far forward.
But actually, we can do even better by taking the difference Δ=SA-SB mod GCD(A, B). You can sort of think of it as the closest approach between these two recurring events. If that's less than DA or more than the GCD-DB, these events are eventually going to overlap. There's some extra details to check whether overlap happens before either end date (if any), but that's the idea in a nutshell.
---
Where it gets really interesting is when you introduce timezones and daylight savings time (DST). Now a day is not always 24h, so there isn't a nice [SA+n*A, SA+n*A+DA) at regular intervals, and we can't naively work modulo the LCM or GCD anymore.
But we can notice that the days of the year on which a DST transition falls generally follow simple rules, and would repeat every 28 years (actually every 400 years due to leap years). In practice for a given timezone, for example CEST, there are only 14 unique days of the year on which the DST transitions can fall, and it repeats after a full period.
So if we want to check for overlap, we can treat those DST transition days as special cases, and all the time intervals in between DST transition days look locally "UTC-like" (time doesn't jump forward or backwards, if we ignore leap seconds), so the previous formula continues to work on all of these.
But at some point things become a little messy, so we did some simplifications =)
My napkin math suggests the size of the supercycle when you take two recurring events A and B with periods < 365 days and the 400 year cycle for DST transition can be about a hundred thousand years in the worst case.
But considering that timezones and daylight savings time are fundamentally political decisions, and that the IANA tzdata file is updated regularly, there is not much sense in applying the current rules thousands of years into the future! So we check at most 400 years ahead, and we consider that when timezones are involved, the overlap check is best effort and we prepare for the fact that it could miss a DST transition – no matter what math we do, a country could change the rules at any time, and a day that used to be 24h could now be 25, 22:30, or even skipped entirely, invalidating all past assumptions.
Thank you so much for sharing this! That sounds like an equally fun and headache-inducing problem indeed!
The idea that someone could book a room for 400 years of consistency guarantees is somewhat hilarious, yet absolutely the kind of thing that would check a box in a regulated environment like healthcare!
It does speak to a larger issue I see quite often, which is that capturing structured intent as of the time an event is created, with the context that led to that decision, is incredibly hard to model. Because no matter what, something about the system will change, like daylight savings time or assumptions around resource/room availability, in an unpredictable way such that the data that had been used to drive a confident evaluation of lack-of-conflict at insertion time, is now insufficient to automatically resolve conflicts.
There's no single way to solve this problem - in a way, "reinterpreting intent" is why programming is often equal parts archaeology and anthropology. Certainly gives us very interesting problems to solve!
To circle it back to the original conversation: the better our databases are at handling large volumes of written data, the more context we can capture around insertion-time intent in an easy-to-query way, and the better we'll be at adapting to unforeseen circumstances!
I use SERIALIZABLE on a database for which I have very low writer concurrency (I can count the total writer connections on 1-2 hands) and where I really, really, really don’t want to deal with the fallout if a transaction commits and results in an impossible state.
I use MySQL, not Postgres, for this application (for better or for worse), and I can absolutely generate a bad state if I drop MySQL to a level below SERIALIZABLE — I’ve tried it. (Yes, I could probably avoid this with SELECT FOR UPDATE, but I don’t trust MySQL enough and I get adequate performance with SERIALIZABLE.)
To make SERIALIZABLE work well, I wrap all the transactions in retry loops, and I deal with MySQL’s obnoxious reporting of which errors are worthy of retries.
(Aside from bad committed states, I’ve also seen MySQL return results from a single straightforward query that cannot correspond to any state of the table. It was something like selecting MIN, MAX and COUNT(*) from a table and getting min and max differing and count = 1. It’s been a while — I could be remembering wrong.)
This is generally true, but there are some additional aspects.
1. With PostgreSQL heap, you need to access the heap page itself. And it's not for free. It goes all through the buffer manager and other related components.
According to all of the above, I believe OrioleDB still wins in the case of secondary key lookup. I think this is indirectly confirmed by the results of the TPC-C benchmark, which contains quite a lot of log of secondary key lookups. However, this subject is worth dedicated benchmarking in the future.
It would be interesting to see how OrioleDB does with more OLAP-like loads. From when I spent a lot of time benchmarking this, the indirect index design was _the_ main reason why MySQL+InnoDB was losing significantly to Postgres on TPC-H (well, DBT-3).[1] There was a lot of working around it with covering indexes etc..
Of course, the flip side of the coin is that if you do an UPDATE of a row in the presence of a secondary index, and the UPDATE doesn't touch the key, then you don't need to update the index(es) at all. So it really depends on how much you update rows versus how often you index-scan them IME.
[1] TPC-H doesn't have difficult enough queries to really stress the planner, so it mattered comparatively less there than in other OLAP work.
Hey HN, I'm the creator of OrioleDB, an extension for PostgreSQL that serves as a drop-in replacement for the default Heap storage engine. It is designed to address scalability bottlenecks in PostgreSQL's buffer manager and reduce the WAL, enabling better utilization of modern multi-core CPUs and high‑performance storage systems. We are getting closer to GA.
This blog post is dedicated to describing a new OrioleDB optimization: ordered insert optimization. When multiple processes are trying to insert into the same page, the first one that grabs the lock does the job for the others. That dramatically cuts the locker waiting time and thus improves the overall throughput. This is very critical for IoT, time series and others.
We would love for more people to test and benchmark OrioleDB. You can check how ordered insertion optimization would help your workload. The fastest way to do that is to use the Docker image provided (please note you need to use a specific tag):
docker run -d --name orioledb -p 5432:5432 orioledb/orioledb:ordered-inserts-pg17
Hey HN,
I'm the creator of OrioleDB, an extension for PostgreSQL that serves as a drop-in replacement for the default Heap storage engine. It is designed to address scalability bottlenecks in PostgreSQL's buffer manager and reduce the WAL, enabling better utilization of modern multi-core CPUs and high‑performance storage systems.
We are getting closer to GA.
This blog post is dedicated to describing a new OrioleDB optimization: fastpath search, an inter-page search that bypasses page copying and tuple deformation for certain fixed-length datatypes.
We would love for more people to test and benchmark OrioleDB. You can check how fastpath search would help your workload. The fastest way to do that is to use the Docker image provided:
docker run -d --name orioledb -p 5432:5432 orioledb/orioledb
Additionally, OrioleDB beta12 features a new fastpath tree search, which can accelerate workloads with intensive key-value lookups by up to 20%. Stay tuned for a new blog post about this later this week.
Neon is indeed unrelated to OreoleDB, but Neon does also provide the separation of storage and compute in Postgres which GP asked about ("Postgres needs options for open-source separation of storage and compute"). A mention of Neon (which is Apache 2 licensed) therefore isn't totally unwarranted.
I understand Neon is open source and I think it’s an awesome product, but apart from the risks associated with longevity of open source once a company gets acquired - although the storage engine is open source, the control plane isn’t and is non-trivial to implement oneself. Orioldedb is positioned as a Postgres extension, which is must easier to setup (even on an existing operating database) than migrating to a completely different architecture that Neon provides.
1) Overhead. SSI implies a heavyweight lock on any involved index page or heap tuple (even for reads). The overhead of SSI was initially measured at ~10%, but nowadays, scalability has gone much farther. These many HW-locks could slow down in multiple times a typical workload on a multicore machine.
2) SSI needs the user to be able to repeat any transaction due to serialization failure. Even a read-only transaction needs to be DEFERRABLE or might be aborted due to serialization failure (it might "saw impossible things" and needs a retry).
In contrast, typically it's not hard to resolve the concurrency problems of writing transactions using explicit row-level and advisory locks, while REPEATABLE READ is enough for reporting. Frankly speaking, during my whole career, I didn't see a single case where SERIALIZABLE isolation level was justified.