In the end, SQL is what people are familiar with (for better or worse) and what is entrenched in some form in most database, and new versions of the standard are unlikely to change anything fundamental.
I don't believe that adding property graph support to SQL makes much of a difference for SQL and relational databases...
However, the PG query additions (I believe they are close to the Cypher language neo4j) being part of the SQL standard will give a boost to marketing and adoption of PG databases. Vendors can now say "our language follows the SQL standard" instead of referencing cypher and indirectly neo4j. Even if neo4j may have the best intentions, marketing is very important in databases and it is unlikely that vendors could have supported Cypher and thus admit that they are following neo4j rather than leading.
Standards can thus be valuable regardless of whether they are adopted or not and a lot of discussion and collaboration was likely necessary to get to this point re: the PG additions.
I think SQL/PGQ have the potential to outright kill the whole field of PG databases (if/once existing databases actually start implementing it).
The big marketing claim of Neo4J & friends has always been that they make graph queries possible, with a big portion of that claim being that those queries would be intractable for RDBMs performance-wise.
With SQL/PGQ it becomes quite apparent that there is next to no magic sauce in PG databases and that it's all just syntactic sugar on top of node/vertex tables. All the rest of their query plans looks 1:1 what you'll find in a RDBMs. Now that they are on a level playing field query syntax-wise, PG databases will have to compete against RDBMs with a lot more advanced query optimizers, better administrative tooling etc..
This is my expectation as well. At CIDR this year there was a paper that presented an implementation of SQL/PGQ in DuckDB, and it outperforms Neo4J in a few selected benchmarks [1].
Cypher has built-in support for traversing relationships between nodes in a graph, which can be more intuitive and efficient than using SQL to join tables.
Don't get me wrong - I love the feature and use it a lot, but I find the syntax awkward and honestly just... strange (though I've not researched it and have no idea of the reasoning behind it).
Will postgres adopt the SQL standard for JSON "simplified accessor" syntax? Has there been any statements or communications from the development team?
> Don't get me wrong - I love the feature and use it a lot, but I find the syntax awkward and honestly just... strange (though I've not researched it and have no idea of the reasoning behind it).
Not sure if Postgres has it, but last time I was using TimescaleDB I've written my own functions in Rust and it wasn't hard at all. I've heard it became even easier recently with PL/Rust https://news.ycombinator.com/item?id=35501065
I really like what PRQL[0] is doing. Fixes a lot of the unfortunate SQL syntax. My impossible dream is that after years of language stability, Postgres could make it a first party query language.
From first principles, it seems reasonable to expect transformations to go in a linear order. From there you look for ways to minimize special cases (eg why is `where` different from `having`?). Then you can just layer on some quality of life things which would appear to a designer after five minutes of real-world use (dear god, please let me have trailing commas).
I would expect there to be a lot of convergent evolution as how to go from a pile of data to narrowing on what you want.
Or force the big RDBMS to use the same syntax like SQL was intended to be, so I don't have to reach for an ORM or maintain different flavors of the same query.
As an aside, if someone knows a tool that reliably converts PostgreSQL queries to something like SQLite, please share!
Wow, this looks like it fixes a lot of the minor gripes I have with SQL, thanks for mentioning it. It feels way less cludgy and "modernized" compared to SQL.
I've read about edgedb in the past https://www.edgedb.com/blog/we-can-do-better-than-sql where they mainly changed the query language, underlying it still uses postgres. Though not sure how successful they've been.
I really like "modern SQL". Snowflake has some great functions and ideas, I really hope other databases take some of their ideas. It also has its pain points (ex Oracle.. so some notions they have about databases like NULLs is a bit meh).
It's a semantic layer with an integrated query language that makes a ton of improvements on SQL. Full disclosure, I actually joined the team a couple of weeks ago :)
One of the biggest damage that SQL has caused is that SQL is so bad as being a programming language, is not good enough DSL and is weirdly the only game in town, that many can't imagine that the relational model allows to do programming like every other model, just richer in "intrinsic" operators.
This is the deal with RDBMS. Is like: "Everyone implement a wild flavor of a VM, sometimes JVM, .NET VM, Web Assembly, eBPF, etc, but all use a slight variation of Java for their syntax"
ORMs generally solve a different problem than the query language. In SQL, everything is a table (that's not 100% true, but it's close enough). So, joins still result in a flat list of records.
ORMs solve the problem of translating a flat list of records into objects that have arrays of other objects. Some people also use them to avoid having to write SQL, but that's honestly a secondary feature.
You can have a more modern query language that still results in tables
I don't have any special knowledge, but I would speculate that it's for approximately the same reasons C allows you to define a byte as 7 bits: Backwards compatibility that let everyone agree on a standard
As 7, no (CHAR_BIT is required to be at least 8, see 5.2.4.2.1p1), but as 9, yes. Ancient machines aside, there are apparently DSPs with 16- or 32-bit machine bytes (addressable units), and people would prefer not to kludge 8-bit C bytes on top of that even if they could by fattening the pointers a bit.
Of course, the question of how the hell you are going to reconcile pervasive 8-bit storage and the C standard’s requirement that fputc|fgetc roundtrip every possible byte is still open—see the current committee chair’s grump about that[1].
In which case this‘ll hurt them no more than the current wide-spread incompatibilities between basic data types e.g. oracle not making a difference between NULL and an empty string, some databases defining storage limits in bytes where others use “characters” (whatever they mean by that), databases ignoring length limit versus truncating versus erroring (possibly the same database engine based on settings), …, and on top of that the database-specific limitations on indexing.
For the ORDER BY in grouped table (F868), it's kind of weird. I guess it's a special case with narrow usage.
In the example, product_id is unique (as PK) along with product_code being unique. There's one product_id for each product_code. A Group By on product_id will produce a list of groups where each group has one product_id and one product_code. A subsequent Order By on the groups by the product_code would work. I assume if there're multiple product_code's per product_id, the query won't work.
CREATE TABLE product (
product_id int PRIMARY KEY,
product_name varchar,
product_code varchar UNIQUE
);
SELECT product.product_id, sum(product_part.num) ...
GROUP BY product.product_id
ORDER BY product.product_code;
> it turned out that this was technically not allowed.
And thought “no, it turned out this query made no sense and had ill-defined results”.
But the rest of the description is odd:
> In particular, it was not allowed to have a grouped table ordered by a column that is not exposed by the SELECT list of the grouped table.
And I’m wondering if the example query was wrong and they meant:
SELECT product.product_id, sum(product_part.num)
FROM product JOIN product_part ON product.product_id = product_part.product_id
GROUP BY product.product_id, product.product_code
ORDER BY product.product_code;
In which case the change makes sense and is inoffensive.
I assume that query optimizers are able to detect a GROUP BY of a primary key and other columns in the same table and optimize accordingly. But using this trick to change the set of valid queries seems odd. I think I'd rather see special syntax for it:
Who comes up with an argument for considering nulls distinct??
Imagine a programming language where `if (variable == null)` always returned `false` because nulls were considered distinct. What kind of reasoning goes into thinking that's a great idea?
> If nulls are considered distinct, then having more than one of them won’t cause a unique constraint violation.
Gosh, I would've thought the spec was fine as it was, having to spell out that the DB should consider two null values equivalent for UNIQUENES constraints is like spelling out that "two integers that have the same mathematical value should be considered equivalent". I guess they need to add that in the spec if it's missing as well.
NULL is more of a declaration that 'there is no value', than mathematical value on its own. if you compare NULL with something you don't get false, you get NULL (which is often considered same as false).
some implementations carry the same "uncomparable" logic when checking for unique constraints, while others drop the ball here.
programming language wise, it's kinda like null pointer or float NaN, both of which are also occasionally complained about, eg. indexing maps by float values can do bogus things in case of NaNs.
Imparting any meaning at all to NULL including non uniqueness or distinction seems fraught with problems. NULL is NULL that’s it. It means nothing else.
If other columns in the row index are unique regardless of the value of the nullable column, then the row is logically unique. So if in that case the NULL value is ignored, then OK. But two NULLs neither match each other nor are they distinct from each other. You just cannot say either way.
First, it should be noted that the reason this toggle was added is because the spec was ambiguous and different databases have different defaults, specifically:
- BigQuery and Db2 forbid nullable columns in indexes, solving the issue
- most databases default to NULLS DISTINCT
- SQL Server defaults to NULLS NOT DISTINCT
- as usual Oracle huffes glue in the corner (it generally uses NULLS NOT DISTINCT, except in the case where every column in the constraint is null in which cases it uses NULLS DISTINCT —- I assume because it just doesn’t apply the constraint at all)
And of course something else makes sense, because SQL’s NULL covers multiple use cases (UNKNOWN, missing, intentionally left out) they leave a lot to the details of the modelling.
For instance let’s say you’re modelling an FS-like structure, you have records which have a parent and a name, and the name needs to be unique within a parent.
That constraint needs to apply the the top-level items still, and while you could model top-level-ness through a special-case hard-coded record, you could just as well model it by leaving the parent field NULL, in which case you’d need `UNIQUE (parent, name)` where “parent” is NULLS NOT DISTINCT. This is especially common under under organic growth.
Adding a separate boolean flag is redundant and distinctly ugly, I’m not sure expressions are supported by every db (and they’re not free) and while sometimes a sentinel value can be used, that’s not the case in general.
SQL only hangs around because it was the best thing available at the time when DBs first exploded into common use. It’s pretty declarative, which is a point in its favor, but it’s not very amenable to query construction since it has AppleScript levels of “being like English.” Have fun writing something that can generate it while handling highly-adaptive and/or customizable requirements.
I’m not going to call Datalog the definitive answer, but it’s a step in the right direction.
Honestly, this really confuses me. On the surface, it seems what you said is true. How can a programming language that emulates natural language be good?
However, in practice, SQL is surprisingly easy to write and expressive. I do prefer using ORMs for simpler queries because of intellisense and typing, but for complex queries, SQL is just better. I really don't understand why.
When you're dealing with SQL you're dealing with items in sets, and really all you're doing is joining, slicing and filtering sets of data to obtain the results you want, perhaps with some extra sugar like aggregation, but in the main it's groups of objects and that's quite intuitive.
On the opposite side is object-orientation where you're dealing with instances of objects. Very intuitive when dealing with properties of an instance (perhaps columns in a table for a particular instance) but not when dealing with multiple instances. Instances of objects don't really gel with sets of objects for various reasons - this class of problems is known as object-relational impedance mismatch (Ireland et al, 2009; Colley et al, 2020). This is why ORMs are often used for development, as a smooth translation layer between object-orientation and set-orientation. Each paradigm is perfectly good within its own context, but they are not particularly compatible approaches.
References if you're interested:
Ireland, C., Bowers, D., Newton, M. and Waugh, K., 2009, March. A classification of object-relational impedance mismatch. In 2009 First International Confernce on Advances in Databases, Knowledge, and Data Applications (pp. 36-43). IEEE. https://ieeexplore.ieee.org/abstract/document/5071809/
Colley, D., Stanier, C. and Asaduzzaman, M., 2020. Investigating the effects of object-relational impedance mismatch on the efficiency of object-relational mapping frameworks. Journal of Database Management (JDM), 31(4), pp.1-23. https://www.igi-global.com/article/investigating-the-effects...
Being english-like is just appearances. SQL is really just relational algebra (the traditional querying bit of it, at least) with some technical limitations. But in essence, it's relational algebra and it works well because the topic was studied a lot by many smart people and the language reflects these developments.
You sound like you mostly write ORM-like OLTP data retrieval. As somebody who's been around analytical dbs for the last 6-7 years I can tell you that SQL falls apart almost immediately on anything longer than 20-30 lines.
I wonder what the problem is, how exactly is it falling apart: a) you need to write a lot of "clean up" sub-queries and fragments or b) the question you ask is more complicated and does not fit into the algebra of SQL.
If it's more the first: if your dataset was perfectly cleaned up (say, 3NF, with all data values cleaned up) — how longer would SQL go for you without falling apart?
As I write this comment I realize that it could be worded as "is the problem with the logic or the data"?
Thanks,
I've had to help some of our data scientists debug their queries after we migrated to Hive 3, and it's mostly the way large queries are structured. You often want to add a single column to the table that is referenced by additional columns (like, convert date of birth to age, then create age buckets and an is_minor flag, calculate the age when they opened their first account).
CTEs are the best way to do this but get clunky and repetitive as you have to give each one a name. CROSS LATERAL JOIN is completely illegible. Something like LET column = expression would have improved that:
FROM customers
LET age = EXTRACT(YEAR FROM current_date() - dob)
LET age_bucket = CASE ... END
LET date_1st_ac = (SELECT MIN ...)
...
SELECT ...
The other problem lies in the fact that an SQL query is one huge query. You have to take it apart to sanity-check the intermediate result: okay, let me quickly turn this CTE into the main query and run it, no, let me try again with another CTE, okay, found it, now I have to convert everything back to run the whole query again.
In comparison, Spark dataframes are all made equal, I can name any one of them, examine their schema, run them to look at the intermediate result, all without introducing optimization barriers.
1. All the little things, inconsistencies and annoyances. You know, NULL behaviour, identifier usage, things that you just have to memorise (everything is non-orthogonal).
2. Language-level. No real abstraction, no code reuse, weird sublanguages for trivial stuff, no libraries/modules/packages.
3. Standard-level. I mean, this is not standard anybody can use. It is unreadable! I mean I can go a very long way if necessary - used to reading the original C++ and C stds. But ANSI SQL... A bottomless pit. Databases do implement subsets of something ANSI-ish but these vary a lot.
All of this results in templated queries, 1000s of lines of queries that are only slightly different and no chance for a truly db-agnostic language. I mean... Have you seen write-once 3000 LOC queries? This is what I can "fall apart"!
I like the original Codd's relational alebra, even love it. It is both simple and practical. Most databases use RelAlg-like internal representations that optimisers work on for a reason!
But we're stuck with SQL and it will never go away.
SQL is just too old and too powerful to be replaced by something. There might be implementations that improve on some subset of it's functionality. But nothing can possibly cover everything it does without itself becoming kludgy feeling
> SQL is surprisingly easy to write and expressive
Counter-example: using only majority-support ISO SQL string functions and operators (SUBSTRING, CHARINDEX, LIKE, NOT LIKE, LEN, etc - but not any regular-expressions), for a string column that contains URI query-string style name+value pairs, e.g. "foo=bar&baz=qux", try to extract out the value for the key "baz" - note that the source column value could also be "&baz=qux&foo=bar&" or "baz=qux&foo=bar&".
Doing that is trivial in most conventional programming languages: just use a couple of local-variables to store CHARINDEX values then pass those into SUBSTRING to get the desired "qux" value.
...but in SQL (at least to my knowledge) there's no concept of local-variables in a SELECT query, instead you need to pass values along via derived-tables or CTEs - or move the logic to a scalar UDF - or give up trying to be succint and do it with dozens of identical CHARINDEX call-sites and pray that the RDBMS is smart enough to memoize UDFs in a query step; none of those options are appealing even in their own right - so we're kinda stuck (and if you want to be succint you'll need not only UDF support, but also inlined and concurrent execution of UDFs otherwise your query performance will tank).
Instead, what if ISO SQL had the ability to define in-query "mini" pure-functions, which represent a (monadic?) sequence of operations, including mutable local variables, over its input parameters - this would solve the verbosity/DRY problem entirely - and without needing you to split a query's logic into two (i.e. the main SELECT query but also many CREATE FUNCTION statements). I note that the existing CTE syntax could be adapted for defining those functions, for example a query to extract querystring values could look like this:
```
WITH FUNCTION getQueryStringValue( input varchar, key varchar ) RETURNS varchar NULL AS (
SET key = CONCAT( key, '=' );
LET keyIdx = CHARINDEX( haystack: input, needle: key );
IF keyIdx < 0 THEN RETURN NULL;
LET valueIdx = keyIdx + LEN( key );
LET nextSeparatorIdx = CHARINDEX( haystack: input, needle: '&', startIndex: keyIdx );
IF nextSeparatorIdx < 0 THEN SET nextSeparatorIdx = LEN( input );
LET valueLen = nextSeparatorIdx - valueIdx;
LET value = SUBSTRING( text: input, startIndex: valueIdx, length: valueLen );
RETURN value;
),
cte AS (
SELECT
getQueryStringValue( src.QueryStringA, 'foo' ) AS fooValue1,
getQueryStringValue( src.QueryStringB, 'foo' ) AS fooValue2,
getQueryStringValue( src.QueryStringC, 'foo' ) AS fooValue3
FROM
src
)
SELECT
CONCAT( fooValue1, fooValue2, fooValue3 ) AS whatevs
FROM
cte
ORDER BY
someCol;
```
I don't disagree things like this are pain points, but one of the key things about unlocking the power of relational algebra is having normalized data, and putting multiple things (eg a bunch of key value pairs) in a single value sort of flies in the face of that. You can often get away with it, but it starts breaking down in more complex cases like this.
In the real world of course, sometimes it happens anyway. Maybe you were storing URIs and only later figured out you needed to extract keys from them. But thinking about this sort of thing beforehand as much as possible can save you a lot of heartache later on.
That said, this fails the ISO SQL requirement (I believe), but I'd use the table-valued function STRING_SPLIT in MS SQL, then I can just filter on LIKE 'baz=%", and SELECT the substring after baz=. Table-valued functions are great. The nice thing about this solution is it's very easy to generalize.
I believe MySQL doesn't have a similar function, but you could do something with SUBSTRING and LOCATE - search for the first occurrence of baz=, then you can use that to find the first location of & after baz=, which will allow you to select the substring between baz= and &.
Let's not erase the work of Stonebraker et al whose efforts through the 70s were a lot closer to Codd's relational algebra than SQL. Ingres's query language, Quel, springs to mind: being directly based on the relational calculus it was far more readily composable than SQL (which might be charitably described as "borrowing ideas from the relational model") was or is.
As with the videotape format war, the eventual industry standard was decided commercially: Oracle was by far the most aggressive, and promoted SQL.
Yes, we can have relational databases without SQL, there are better ways for composability. Some might argue that SQL itself isn't a mess, but it sure does help people make a mess of themselves.
I had a period of perferring MongoDB for this reason. Because the query is the bare AST in Mongo, dynamic query construction is simply manipulating some JSON.
I’ve long felt that SQL was somehow backwards, but I don’t know what would be better. ‘Go and get two eggs’ Is better than ‘go to the fridge and get eggs, only two’.
Strictly following the semantics order of the querying, whether forward or reverse.
The ordering issue of SQL is that it uses a mixed ordering, the physical layout of the query in text form doesn’t follow the logical execution of the query, which makes it very confusing.
Comparing to short cutesy sentences is not helpful as a non-trivial query has dozens or hundreds of steps, it’s closer to a build book or checklist.
Furthermore english is far from a paragon of clarity or regularity, and languages which try to follow it (SQL, AppleScript) tend to scale a lot worse with instruction sizes and complexity.
That's not standard SQL. This was standardized pretty late, so different RDBMS have various syntaxes, but the standardized one is to add a clause like this:
FETCH FIRST n ROWS ONLY
In PostgreSQL and MySQL the usual syntax is a LIMIT clause, but PostgreSQL also supports the standard. MSSQL and some others use TOP n directly after SELECT. I think in some others you need to use a window function.
The SQL Standard appears to me to be some sort of political tool, likely developed and used by Oracle. It is the most significant standard I know of that isn't publicly available.
Seriously, if you want a legal copy that costs something like 200 Swiss francs. This isn't how serious comp-sci standards work (the TCP people just throw out RFCs). Especially since the SQL standards committee probably don't know how to design a language given what most SQL implementations look like. I can't be sure though, I'll probably never be in a position to read the official standard.
Reflect on the madness. Postgres? Free. SQL Standard? 200 francs. Postgres makes a better standard than the SQL standard. The ISO standard thing isn't here to support people using or building SQL databases.
I sometimes fail to understand the disconnect in some of us here: you want six figure salaries, AND have standardization committees work for free (and all software and AI models should be provided for free as well).
This would all make sense if you were in favor of fat governments, but that does not seem to be the case either.
The entitlement baffles me. Or am I accidentally combining opinions of different people and making a misjudgment?
I'm not asking for anything; the postgres documentation is more important to me than the standard and already free. I'm pointing out that the standard isn't intended for implementers or users. By ruling out options, my guess is it is intended for governments to select Oracle DBs rather than alternate (OSS?) DBs - or just to put petty impediments in the way of new DBs trying to get started. Otherwise why is the standard the only component of the entire chain that is not free on the internet? It would be bizarre if ONLY the standards committee needs $$$ to operate. They have the easiest job and there are natural incentives for the different SQL implemeters to cooperate. Why else would the core postgres maintainers be expected to pay 200 francs for the privilege of getting a copy of a standard that they are implementing for free? That way lies madness.
And we know the standards committee would be willing to work for free or people are available who would do the work (indeed, we can probably recover the standard from the postgres docs since they mention the things that aren't standards compliant - https://www.postgresql.org/docs/current/features.html). The postgres maintainers do notably more work than the standards committee for free. How hard is it for them and Dr Hipp to throw out a document saying where the common ground is? It is far less work than they've already put in to the free docs. The useful parts SQL standard is an appendix compared to https://www.sqlite.org/docs.html
At least, I assume. As previously mentioned, I've never read the standard and as far as I can tell nobody is allowed to link it for me.
EDIT I feel like I should link some standard here, so in lieu of the SQL standard which I can't link, here is the TCP standard: https://www.ietf.org/rfc/rfc9293.html
It is entirely possible for people to be paid to produce something that is made available at no cost to certain other people. For example, that hand sanitizer dispensor at the pet store. Yes, somebody is paying for it.
In the case of a standards body, the people that do the work can get paid in various ways. The finished product can be published "for free".
In some research contexts this could very much be expected. If researchers were paid through tax-funded grants, for instance, the results of the research should be available to those tax payers. If they are also available to non-taxpayers, perhaps there is some altruism at play.
It's not an absolute given. But it's not an unreasonable expectation.
True, although for both C and C++, standards drafts are prepared on GitHub. At least for C (but I presume also for C++) you can check out a revision which is functionally equivalent to the ISO standard (but minus the branding).
It’s the space after the colon in the submission title (was that the infamous HN title normalizer?). The original has SQL:2023, which makes sense if you know that the ISO naming scheme introduces the year with a colon (e.g. “C99” is officially ISO 9899:1999, “C95”—that is ANSI C plus wchar.h and wctype.h—is ISO 9899:1990/Amd 1:1995).
It made perfect sense to me at first but now that you mention it, the title is quite funny. I guess if "SQL" says 2023 is over, who are we to argue - happy new year everyone!
Postgres has support for specifying NULL behavior in UNIQUE constraints since version 15.0 (released in 2022), with the same syntax as described in the article.
Where is the doc for this one? I can find Postgres’ own JSON(B) operators, as well as jsonpath (sql/json path language) support, but not SQL-level json paths, specifically the dot syntax.
CREATE TABLE person (
id INT PRIMARY KEY,
name VARCHAR NOT NULL
);
CREATE TABLE pet (
id INT PRIMARY KEY,
name VARCHAR NOT NULL,
owner_id INT NOT NULL
);
It is common to want to join them, like so:
SELECT owner.id, owner.name, COUNT(pet.id) AS numberOfPets
FROM person AS owner
LEFT OUTER JOIN pet ON owner.id = pet.owner_id
GROUP BY owner.id
This doesn't work in standard SQL, because all columns in the SELECT list have to either be aggregated or included in the GROUP BY. owner.name is neither. That is a bit silly though because we know each result row will only have one unambiguous value for the owner name, since the GROUP BY is on a unique column from the same table as the owner name.
We can solve this with ANY_VALUE:
SELECT owner.id, ANY_VALUE(owner.name) AS name, COUNT(pet.id) AS numberOfPets
FROM person AS owner
LEFT OUTER JOIN pet ON owner.id = pet.owner_id
GROUP BY owner.id
I run into the equivalent problem in Kotlin sometimes.
I want to group things by a key, but that results in type `Map<K, List<V>>` because the compiler can't be sure that there won't be more than one `V` for the field being used for grouping... which is why there's also `associateBy` which does return `Map<K, V>`, but will throw if there happens to be more than one value for each K.
Not relevant, but I thought it may help other readers understand the problem.
Yes, but all these solutions require the optimizer to be smart enough to be smart enough to optimize them away: "oh, owner.id is the primary key, so any value from owner will appear at most once, I don't have to group by them or track the maximum value, I can pick and store any value".
I think all optimizers are smart enough already, as this is a super common pattern. I honestly would have preferred this logic to be made more explicit: if you group by the primary key (or by the table name, as someone suggested), you can use all columns from the table without aggregate functions.
This kinda breaks down when you introduce subqueries and CTEs, which is where myopic optimizers can screw the pooch as well, so any_value can be of use there.
Any circumstance where it would be needed should have given a syntax error irrespective of the data present (all matching or with differences) – everything in the projection list needing to be in the grouping clause or an aggregate.
For a very simple but slightly odd (why would you ask exactly that?) example:
> SELECT FirstName, WorkEmail FROM Person GROUP BY WorkEmail
should give an error in current SQL implementations. Despite the fact that in the data in question there is no ambiguity as the email addresses are never the same between two people, the database has no way of knowing that. The error message in SQL Server in this case is “Column 'Person.FirstName' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause”. In this case you could declare WorkEmail to be unique, but that should still error as the standard doesn't take this into account. The new ANY_VALUE us counted as an aggregate so “fixes” the syntax error like so:
> SELECT ANY_VALUE(FirstName), WorkEmail FROM Person GROUP BY WorkEmail
Basically it is a way of telling the query planner that you know your data better than it does, and it shouldn't worry its little head and take the first thing it sees to project for that column.
mySQL actually does this by default, or at least it used to, using the first thing it sees for columns that are neither aggregates nor part of the grouping clause, and you'll find any number of people asking why this doesn't work in other databases (SQL Server, postgres, …) when they try porting things over.
--
ANY_VALUE() feels icky to me, similar to how the related mysql behaviour always did. It is explicitly giving an opening for the output to change arbitrarily for a given query and that makes me think it is going to be the source of difficult to find bugs, for instance when some code relies on a particular output but changes in data balance hit a tipping point and make the query planner take a different route meaning ANY_VALUE() returns something other than it has in that place for months/years.
I can think of examples where it could be useful or at least makes things no worse (you need to get a little more complex than my single table example above), i.e. where trying to disambiguate with another aggregate like MAX still gives effectively arbitrary results, trying with ROW_NUMBER() tricks adds complexity (both for the human reading the code and the query planner), and adding the column to the GROUP BY would cause duplicate rows otherwise, but I'd rather not have it because of the issues that it could easily create in the wrong hands.
It's actually super useful if you've got a table with two columns like "sku" and "product_name", where the value in both columns has one unique value from the other column paired with it. You can group by sku, and then grab any value for product_name, since they're all the same. Using min/max/etc involves extra computations that you don't need to do.
You could group by the combination of sku+product_name, but it's kind of unwieldy. You could also normalize it to not have this issue, but when you're getting this data from someone else, there's not much you can do.
I guess SQL is so widely used that there are always some niche cases. I guess it is useful if you just want to show one example value where there is no numerical order, and you want fetching this value to be superfast.
Like a movie database, and you want to show an example title for every genre.
But in a real-world scenario, I would want a popular movie, not literally any. If you have an Academy Award winners table, I guess, then everyone is popular.
Still no explicit support for enums... best thing about Postgres in the brief period I used it was that. Looking at tables full of integer values and trying to remember what each mean does everyone's head in IME.
this whole nulls distinct v not distinct is a side-effect of the standard not being more explicit in what nulls really mean -- so to me adding this distinction in the unique constraint looks like a code smell if one could apply that term to standards.
There is no such thing as “ANSI Standards,” as ANSI does not develop standards. Instead, there’s American National Standards and other documents written by ANSI-approved committees of standards developing organizations. Nonetheless, we get a lot of inquiries for the “ANSI standard” for SQL. It’s worth noting that, while this phrase is misleading and inaccurate for numerous reasons, it is referring to existing standard documents. SQL, just like many great things that outlasted the 70s, has a powerful history auspiciously intertwined with standards. At the conception of SQL’s specifications was ANSI (just another great feat throughout our 100-year history).
The Current Edition of ISO/IEC 9075 for SQL
If you don’t want to read all the history on SQL that we’ve detailed below, to make things short, SQL was standardized in ANSI X3.135 in 1986, and, within a few months, it was adopted by ISO as ISO 9075-1987. While most vendors modify SQL to meet their needs, they generally base their programs off the current version of this standard. The international standard (now ISO/IEC 9075) has been revised periodically ever since, most recently in 2016. It exists in 9 parts: [...]
Thank you for enlightening me with that detail. I’ve taken for granted that there is a generally standard version of SQL and the token term for it is ANSI SQL. I was in T-SQL (Microsoft), PL-SQL (Oracle), and DB2 SQL (IBM) dialects and now I feel at home with PostgreSql (ANSI SQL) which generally works across other DB platforms.
I wonder what the revenue per developer is for the SQL server team. It must be enormous at this point. Obviously declining as Postgres is so good but lots of companies / DBAs still demand SQL server (and it is not cheap).
Funny story. at Microsoft it is common to call anything having to do with "SQL Server" simply "SQL". It was even on internal tchotchkes, like a "Linux <3 SQL" fridge magnet, presumably alluding to Linux support for Microsoft SQL Server. Since I came to them by the way of Citus acquisition and the Postgres world generally, this was first confusing (where SQL, to me, always meant the family of implementations, and the standard) and then amusing...but not quite endearing, in its imperiousness.
On the other hand, they could have done worse than to name their database product the both bland and imperious "SQL Server:" they could have repeated their performance in naming "WinCE" as is the customary abbreviation of "Windows," plus "Compact Edition" which, if you simply type it out, spells "wince."
It saddens me to see how misguided some groups attempt at language creation are. Contrast these 2 quotes:
(greatest/least) "To be clear, in most programming languages, these functions would merely be called max and min. But in SQL, those names are already used for aggregate functions. Since there is no syntactic difference between normal functions and aggregate functions, you have to pick two different names."
"After 2-3 years the language had grown by accretion....
Eventually I found it was shrinking, once you look at enough different applications you come to general notions,
these generalizations allowed me to take out whole chunks of special things
and furthermore the general ideas are usually much simpler to understand than the special cases."
The easiest path for someone creating a language or API is to add a new layer/function for everything. It solves the immediate problem, it allows you to create something new, it can't break existing code. As a java programmer I often find myself creating an abstraction layer without thinking. What's made this much worse actually is google and stackoverflow, people can't google general solutions but they can google for "convert php string to camel case". However when I look at beautiful implementations of languages, they are beautiful in their exposed simplicity.
I don't believe that adding property graph support to SQL makes much of a difference for SQL and relational databases...
However, the PG query additions (I believe they are close to the Cypher language neo4j) being part of the SQL standard will give a boost to marketing and adoption of PG databases. Vendors can now say "our language follows the SQL standard" instead of referencing cypher and indirectly neo4j. Even if neo4j may have the best intentions, marketing is very important in databases and it is unlikely that vendors could have supported Cypher and thus admit that they are following neo4j rather than leading.
Standards can thus be valuable regardless of whether they are adopted or not and a lot of discussion and collaboration was likely necessary to get to this point re: the PG additions.
In this context, it is also worth mentioning the GQL standard here, which neo4j folks are part of https://en.m.wikipedia.org/wiki/Graph_Query_Language