More

butlerm · on Dec 22, 2023

> If that's the case, you will have an overhead anyway, the only question being whether it's at the DB level or at the application level.

Inserts and updates do not require referential integrity checking if you know that the reference in question is valid in advance. Common cases are references to rows you create in the same transaction or rows you know will not be deleted.

If you actually want to delete something that may be referred to elsewhere then checking is appropriate of course, and in many applications such checking is necessary in advance so you have some idea whether something can be deleted (and if not why not). That type of check may not be race free of course, hence "some idea".

butlerm · on Dec 21, 2023

Referential integrity problems usually happen due to missing deletes, improper deletes, or references that should be cleared.

The overhead of checking for the existence of referred to records in ordinary inserts and updates in application code is unnecessary in most cases, and that is where the problem is. Either you have to check to have any idea what is going on, because your key values are being supplied from an outside source or you should be able write your application so that it does not insert random references into your database.

If you actually need to delete a row that might be referred to, the best thing to do is not to do that, because you will need application level checks to make the reason why you cannot delete something visible in any case. 'Delete failed because the record is referred to somewhere' is usually an inadequate explanation. The application should probably check so that delete isn't even presented as an option in cases like that.

setr · on Dec 21, 2023

> If you actually need to delete a row that might be referred to, the best thing to do is not to do that, because you will need application level checks to make the reason why you cannot delete something visible in any case. 'Delete failed because the record is referred to somewhere' is usually an inadequate explanation. The application should probably check so that delete isn't even presented as an option in cases like that.

I feel like this belongs to the same strategy as duplicating form-validation on frontend/backend. The frontend validations can't be trusted (they can be skipped over with e.g. curl POST), so backend validation must be done. But you choose duplicate it to the frontend for user-convenience / better reporting / faster feedback loop. The backend remains the source of truth on validations.

The same between database and application; the database is much more likely to be correct when enforcing basic data constraints and referential integrity. The application can do it, its just a lot more awkward because they're also juggling other things and have a higher-level view of the data (and the only real way to check you didn't screw up is to make your testcase do exactly the same thing... but be correct about it -- no one else is going to tell you your dataset got fucked. Also true in an RDBMS, but it's trivial to verify by eye, and there's only one place to check per relationship). Thus in my world-view, the database must validate, and the application can choose to duplicate validation for user-convenience / better reporting. The database remains the source of truth on validations. As an optimization, you remove the database validations, but at your own risk.

And then in a multi-app, single db world, then you really can't trust the application (validations can be skipped), so even that optimization is likely illegal. Or you do many-apps *-> single-api -> db, and maintain the optimization at the cost of pretty much completely dropping the flexibility of having an RDBMS in the first place

butlerm · on Dec 21, 2023

Large ERP systems do that sort of thing as a matter of course and have for decades now. It does require careful planning and design. I mean AR / AP / scheduling / manufacturing / inventory and so on.

The main downside of splitting everything into isolated databases is that it makes it approximately impossible to generate reports that require joining across databases. Not without writing new and relatively complex application code to do what used to require a simple SQL query to accomplish anyway.

Of course if you have the sort of business with scalability problems that require abandoning or restructuring your database on a regular basis, then placing that kind of data in a shared database is probably not such a great idea.

It should also be said that common web APIs as a programming technique are much harder to use and implement reliably due to the data marshalling and extra error handling code required than just about any system of queries or stored procedures against a conventional database. The need to page is perverse, for example.

That does not mean that sort of tight coupling is appropriate in many cases, but it is (typically) much easier to implement. Web APIs could use standard support for two phase commit and internally paged queries that preserve some semblance of consistency. The problem is that stateless architecture makes that sort of thing virtually impossible. Who knows which rows will disappear when you query for page two because the positions of all of your records have just shifted? Or which parts of a distributed transaction will still be there if anything goes wrong?

butlerm · on Dec 21, 2023

If you do not particularly care about performance or have a great deal of headroom then database enforcement of referential integrity is great. Alternatively you could just write test cases to check for it and not pay the severe performance penalty.

The other major downside of database enforcement of referential integrity is the common need to drop and re-create foreign keys during database schema upgrades and data conversions.

setr · on Dec 21, 2023

You’re still going to pay the cost of maintaining referential integrity — you’re just doing it on the app side. You can do it faster by being not-correct — eg you don’t need a lock if you ignore race conditions — but it’s not like the database is arbitrarily slow at doing one of its basic fundamental jobs.

Of course, you can just skip the validation altogether and cross your fingers and hope you’re correct, but it’s the same reasoning as removing array bounds checking from your app code; you’ve eked out some more performance and it’s great until it’s catastrophically not so great.

Your reasoning should really be inverted. Be correct first, and maintain excessive validation as you can, and rip it out where performance matters. With OLTP workloads, your data’s correctness is generally much more valuable than the additional hardware you might have to throw at it.

I’m also not sure why dropping/creating foreign keys is a big deal for migrations, other than time spent

butlerm · on Dec 21, 2023

It is quite common for modern databases to have multiversion concurrency so that writers do not block readers. If you do not your transactions should either be awfully short, you should be prepared to wait, or you should implement dirty reads (which are quite common in any case).

dasil003 · on Dec 21, 2023

"just write test cases to check for [referential integrity]" is doing some awful heavy lifting in this comment.

Assuming a standard n-tier application architecture, how do you guarantee the test prevents race conditions?

jrumbut · on Dec 21, 2023

You either end up reinventing foreign keys, your support volume will scale faster than your data, or user experience will suffer.

There may be situations where foreign keys become too much overhead, but it's worth fighting to keep them as long as possible. Data integrity only becomes more important at scale. Every orphaned record is a support ticket, lost sale, etc.

butlerm · on Dec 21, 2023

Orphaned detail records are usually inconsequential, like uncollected garbage. References to anything with an optional relationship should use outer joins as a matter of course. If you delete something that really needs to be there you have a problem, which is one of the reasons not to delete rows like that, ever, but rather to mark them as inactive or deleted instead.

butlerm · on Dec 21, 2023

Typically you look for orphan rows - the sort of thing ON DELETE CASCADE was invented to prevent. Another thing to check for are records that need to exist but should have references cleared when something else is deleted, e.g. ON DELETE SET NULL. And the third thing is ON DELETE RESTRICT.

You can check for the first two of those things after the fact, and they are relatively benign. In many cases it will make no difference to application queries, especially with the judicious use of outer joins, which should be used for all optional relationships anyway.

If you need ON DELETE RESTRICT application code should probably check anyway, because otherwise you have unexpected delete failures with no application level visibility as to what went wrong. That can be tested for, and pretty much has to be before code that deletes rows subject to delete restrictions is released into production.

As far as race conditions go, they should be eliminated through the use of database transactions. Another alternative is never to delete rows that are referred to elsewhere and just set a deleted flag or something. That is mildly annoying to check for however. Clearing an active flag is simpler because you usually want rows like that to stay around anyway, just not be used in new transactions.

hermanradtke · on Dec 21, 2023

> you could just write test cases to check for it

This effectively means you are building an embedded database in your application and using the networked database for storage. There are a few reasons to do this and a million reasons not to.

butlerm · on Dec 13, 2023

Perhaps someone should define a new C compatible threading API to allow C libraries (including glibc or a wrapper around glibc) to work with something other than native pthreads. Such as goroutines or Java threads and so on.

morelisp · on Dec 13, 2023

Many general M:N threading solutions have been tried over the years. As far as I know current thinking is still that you need substantial cooperation from a language runtime to make it worthwhile. (And even then it's hard - Java's first attempt failed and they went 1:1 essentially between 1998-2022.)

butlerm · on Dec 12, 2023

It is basically impossible to write general purpose software like compilers, word processors, and layout engines without doing heap allocations. That means either pointers or references, which are difficult to distinguish if you do not engage in pointer arithmetic.

Any C++ program that does not do heap allocations either uses arrays as a substitute for the same thing or isn't a general purpose application.

butlerm · on Dec 12, 2023

It is generally speaking difficult to make an efficient implementation of the compiler and/or the virtual machine for many memory safe languages without writing it in a more efficient, statically compiled language like C, C++, or Rust. And that is to say nothing of software like operating system kernels and browser engines. So perhaps Rust will gradually take over the world there.

1970-01-01 · on Dec 12, 2023

Simplified, this means it's generally difficult to safely make things go fast. And there's nothing wrong with that. The sooner we realize it the better.

butlerm · on Dec 12, 2023

That would be convenient, and some programming environments have support for that kind of thing already. A hierarchical object valued expression would also be convenient in a different way.

butlerm · on Dec 12, 2023

The main application for this is where you have detail data for parent records in a snowflake pattern. In that case SQL tends to require a ridiculous number of queries, where common formats like JSON and XML are capable of transferring hierarchical data like that in a single response. That is a major weakness of SQL and the common inability to return a hierarchical set of relations in one response in particular.

Also, running a ridiculous number of queries in parallel is not practical on many databases due to per connection overhead, a problem that is so severe that many databases have internal many-to-one connection demultiplexers already, i.e. they have N execution engines for M connections. That should sound familiar to those who are familiar with threading models.

butlerm · on Dec 5, 2023

It is possible for telcos to provide point to point or point to multipoint layer 2 permanent virtual circuits (PVCs) from customers to providers or branch offices to home offices, and it used to be common. Frame relay and ATM (asynchronous transfer mode) were quite popular technologies for that.

In many if not most areas in the United States DSL (digital subscriber loop) based Internet access was originally delivered over PVCs established through a layer 2 ATM network. There were interesting problems with that so PPP or PPP over Ethernet is more common these days, even when the telco no longer really lets anyone compete with them in the provision of broadband Internet access services at layer 3 over the network they maintain thanks to a rather convenient federal court decision.

Layer 2 mostly Ethernet access over VLANs (virtual local area networks) to a chosen provider does live on in certain mostly municipally owned multi-provider networks though, and in some countries that is normal, although usually with the incumbent telco or ILEC (incumbent local exchange company) installing and maintaining the last mile to homes and businesses rather than a municipal operator as in some parts of the United States. Either way more than one provider can provide layer 3 Internet service on the same physical facilities that way, with layer 2 (e.g. switched Ethernet) virtual lans or virtual circuits operated by one company or municipality.