Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SQL:2023 is finished: Here is what's new (eisentraut.org)
278 points by petalmind on April 13, 2023 | hide | past | favorite | 153 comments


In the end, SQL is what people are familiar with (for better or worse) and what is entrenched in some form in most database, and new versions of the standard are unlikely to change anything fundamental.

I don't believe that adding property graph support to SQL makes much of a difference for SQL and relational databases...

However, the PG query additions (I believe they are close to the Cypher language neo4j) being part of the SQL standard will give a boost to marketing and adoption of PG databases. Vendors can now say "our language follows the SQL standard" instead of referencing cypher and indirectly neo4j. Even if neo4j may have the best intentions, marketing is very important in databases and it is unlikely that vendors could have supported Cypher and thus admit that they are following neo4j rather than leading.

Standards can thus be valuable regardless of whether they are adopted or not and a lot of discussion and collaboration was likely necessary to get to this point re: the PG additions.

In this context, it is also worth mentioning the GQL standard here, which neo4j folks are part of https://en.m.wikipedia.org/wiki/Graph_Query_Language


Huh, how views can differ...

I think SQL/PGQ have the potential to outright kill the whole field of PG databases (if/once existing databases actually start implementing it).

The big marketing claim of Neo4J & friends has always been that they make graph queries possible, with a big portion of that claim being that those queries would be intractable for RDBMs performance-wise.

With SQL/PGQ it becomes quite apparent that there is next to no magic sauce in PG databases and that it's all just syntactic sugar on top of node/vertex tables. All the rest of their query plans looks 1:1 what you'll find in a RDBMs. Now that they are on a level playing field query syntax-wise, PG databases will have to compete against RDBMs with a lot more advanced query optimizers, better administrative tooling etc..


This is my expectation as well. At CIDR this year there was a paper that presented an implementation of SQL/PGQ in DuckDB, and it outperforms Neo4J in a few selected benchmarks [1].

[1] https://www.cidrdb.org/cidr2023/papers/p66-wolde.pdf


In relational databases, the property graph queries can be really effective to write authorization queries.

Whether user 'x' has access to a resource 'z' which is related via 'y', becomes a much simpler looking query.


Cypher has built-in support for traversing relationships between nodes in a graph, which can be more intuitive and efficient than using SQL to join tables.


Lots of nice stuff in here.

    SELECT t.j.foo.bar[2], ... FROM tbl t ...
I'm going to miss having this for years, probably.


Or.....you could just use PostgreSQL 14+ and have it today :)


I don’t think postgres has that tho? It has support for SQL2016’s jsonpath, but those are string queries, not sql-level traversal.


It has it, but the syntax is different:

    SELECT t->'j'->'foo'->'bar'->2, ... FROM tbl t ...
I think SQLite now also supports this syntax.


I would not call “has similar features with a completely incompatible syntax” “having it”.

After all if that’s your benchmark every database with a TEXT type already has T081.


Sort of.

The syntax is different, but close.

SQL 2023: SELECT a.key

PostgreSQL 9.3: SELECT a->'key'

PostgreSQL 14: SELECT a['key']


I've always disliked the postgres JSON syntax.

Don't get me wrong - I love the feature and use it a lot, but I find the syntax awkward and honestly just... strange (though I've not researched it and have no idea of the reasoning behind it).

Will postgres adopt the SQL standard for JSON "simplified accessor" syntax? Has there been any statements or communications from the development team?


> Don't get me wrong - I love the feature and use it a lot, but I find the syntax awkward and honestly just... strange (though I've not researched it and have no idea of the reasoning behind it).

I'm pretty sure the syntax was just reused from hstore (https://www.postgresql.org/docs/current/hstore.html), for familiarity and simplicity. hstore predates json by 4-5 years.

It might also have made expressions easier to parse, as it doesn't conflict with normal identifier access.


Not sure if Postgres has it, but last time I was using TimescaleDB I've written my own functions in Rust and it wasn't hard at all. I've heard it became even easier recently with PL/Rust https://news.ycombinator.com/item?id=35501065


"Not sure if Postures has it, but for a totally unrelated database I wrote my own implementation in a famously finicky, but powerful language."


Made worse by “it” specifically not being a function, that’s the entire point. It’s already available behind a function and a stringly-typed path.


Spark has something similar but you need to use a colon to as the first member. So something like:

  t.j:foo.bar
Hopefully all the different dialects can adopt this syntax.

https://docs.databricks.com/optimizations/semi-structured.ht...


Is there a competing query strategy, technology, or approach to compete with SQL? Yes, it gets the job done but it's always felt kludgy.


I really like what PRQL[0] is doing. Fixes a lot of the unfortunate SQL syntax. My impossible dream is that after years of language stability, Postgres could make it a first party query language.

[0] https://prql-lang.org/


The syntax looks very similar to Microsoft's Kusto Query Language (KQL): https://learn.microsoft.com/en-us/azure/data-explorer/kusto/...

It's interesting to see that everyone's attempt at fixing SQL seems to converge on surprisingly similar solutions...


Microsoft did linq first which is probably the basis for the rest.


From first principles, it seems reasonable to expect transformations to go in a linear order. From there you look for ways to minimize special cases (eg why is `where` different from `having`?). Then you can just layer on some quality of life things which would appear to a designer after five minutes of real-world use (dear god, please let me have trailing commas).

I would expect there to be a lot of convergent evolution as how to go from a pile of data to narrowing on what you want.


Maybe as a baby step we could just get legal trailing commas


Or force the big RDBMS to use the same syntax like SQL was intended to be, so I don't have to reach for an ORM or maintain different flavors of the same query.

As an aside, if someone knows a tool that reliably converts PostgreSQL queries to something like SQLite, please share!


Not sure how such a tool could possibly exist, given all the functionality of Postgres that SQLite simply lacks.


You could ask ChatGPT to do this sort of conversion


Still not going to have wave around things like not having transactions or proper trigger functions.


SQLite does have transactions. But I do agree with your general point that Sqlite is worlds apart from Postgre


sqlglot might be what you are looking for.

https://github.com/tobymao/sqlglot


haven't really tried for that transform but have a look at https://www.jooq.org/translate/


Wow, this looks like it fixes a lot of the minor gripes I have with SQL, thanks for mentioning it. It feels way less cludgy and "modernized" compared to SQL.


That reminds me of the Ecto DSL.


There are some ideas based on Datalog, like this one:

https://opensource.googleblog.com/2021/04/logica-organizing-...


KQL https://learn.microsoft.com/en-us/azure/data-explorer/kusto/...

Hers a simple example:

  StormEvents
  | where StartTime between (datetime(2007-11-01) .. datetime(2007-12-01))
  | where State == "FLORIDA"
  | count


I've read about edgedb in the past https://www.edgedb.com/blog/we-can-do-better-than-sql where they mainly changed the query language, underlying it still uses postgres. Though not sure how successful they've been.


I really like "modern SQL". Snowflake has some great functions and ideas, I really hope other databases take some of their ideas. It also has its pain points (ex Oracle.. so some notions they have about databases like NULLs is a bit meh).

Paired with dbt makes data quite fun


I've written quite a bit about a new query language called Malloy: https://carlineng.com/?postid=malloy-intro#blog

It's a semantic layer with an integrated query language that makes a ton of improvements on SQL. Full disclosure, I actually joined the team a couple of weeks ago :)


Spark dataframes in Scala seem to combine the best parts of relational queries with at least some typing.

https://spark.apache.org/docs/1.6.1/api/java/org/apache/spar...


There are many, but most never go further than SQL (ie: Are just a "query" language).

I'm trying to do it for a "general purpose" language where array + relational + functional + imperative can work together:

https://tablam.org

---

One of the biggest damage that SQL has caused is that SQL is so bad as being a programming language, is not good enough DSL and is weirdly the only game in town, that many can't imagine that the relational model allows to do programming like every other model, just richer in "intrinsic" operators.

This is the deal with RDBMS. Is like: "Everyone implement a wild flavor of a VM, sometimes JVM, .NET VM, Web Assembly, eBPF, etc, but all use a slight variation of Java for their syntax"

Is like the reverse of how things are done!


Datalag is great! I use it every day in production.

https://www.learndatalogtoday.org/


An unfortunate typo.


R's dbplyr package lets you use the amazing dplyr syntax to query SQL databases.


You mean, the ORM?

Edgedb also is nice if you want a more object like model for your db.


ORMs generally solve a different problem than the query language. In SQL, everything is a table (that's not 100% true, but it's close enough). So, joins still result in a flat list of records.

ORMs solve the problem of translating a flat list of records into objects that have arrays of other objects. Some people also use them to avoid having to write SQL, but that's honestly a secondary feature.

You can have a more modern query language that still results in tables


Can someone explain why the standard endorses this ambiguity:

CREATE TABLE t1 (a VARCHAR, b VARCHAR)

The length of `a` and `b` is left up to the implementers: "an implementation-defined default limit will be applied."

Why even allow ambiguity in the first place? This is just asking for problems.


IIRC SQL already allows default limits for `NUMERIC`, `DECIMAL`, `FLOAT`, and `TIMESTAMP`.

This is simply allowing it for `VARCHAR` as well. In a sense, making it more consistent :)


A wonderful example of illogical logic – officially making things more consistently inconsistent!


Likely as a way to ask for arbitrary text up to something smaller than TEXT.

The VARCHAR(000) form is already limited in its max by implementations.


I don't have any special knowledge, but I would speculate that it's for approximately the same reasons C allows you to define a byte as 7 bits: Backwards compatibility that let everyone agree on a standard


As 7, no (CHAR_BIT is required to be at least 8, see 5.2.4.2.1p1), but as 9, yes. Ancient machines aside, there are apparently DSPs with 16- or 32-bit machine bytes (addressable units), and people would prefer not to kludge 8-bit C bytes on top of that even if they could by fattening the pointers a bit.

Of course, the question of how the hell you are going to reconcile pervasive 8-bit storage and the C standard’s requirement that fputc|fgetc roundtrip every possible byte is still open—see the current committee chair’s grump about that[1].

[1] https://thephd.dev/conformance-should-mean-something-fputc-a...


People don’t interchange databases that often.


Some people (unfortunately) have to support multiple database brands.


In which case this‘ll hurt them no more than the current wide-spread incompatibilities between basic data types e.g. oracle not making a difference between NULL and an empty string, some databases defining storage limits in bytes where others use “characters” (whatever they mean by that), databases ignoring length limit versus truncating versus erroring (possibly the same database engine based on settings), …, and on top of that the database-specific limitations on indexing.


Why not directly allow each column to be specified as “NULLS DISTINCT”?

  unique (a null distinct, b)


> ORDER BY in grouped table

So that means you’re allowed to use values you haven’t grouped by and haven’t aggregated? Sounds like MySQL to me


For the ORDER BY in grouped table (F868), it's kind of weird. I guess it's a special case with narrow usage.

In the example, product_id is unique (as PK) along with product_code being unique. There's one product_id for each product_code. A Group By on product_id will produce a list of groups where each group has one product_id and one product_code. A subsequent Order By on the groups by the product_code would work. I assume if there're multiple product_code's per product_id, the query won't work.

    CREATE TABLE product (
        product_id int PRIMARY KEY,
        product_name varchar,
        product_code varchar UNIQUE
    );

    SELECT product.product_id, sum(product_part.num) ...
    GROUP BY product.product_id
    ORDER BY product.product_code;


Agreed. I read:

> it turned out that this was technically not allowed.

And thought “no, it turned out this query made no sense and had ill-defined results”.

But the rest of the description is odd:

> In particular, it was not allowed to have a grouped table ordered by a column that is not exposed by the SELECT list of the grouped table.

And I’m wondering if the example query was wrong and they meant:

    SELECT product.product_id, sum(product_part.num)
    FROM product JOIN product_part ON product.product_id = product_part.product_id
    GROUP BY product.product_id, product.product_code
    ORDER BY product.product_code;
In which case the change makes sense and is inoffensive.


It's already grouping by the primary key of `product`. Why would you need to add another column in `product` to the group by?


If that’s the very specific case, then they should be clear about it. But my interpretation was that our order by ungrouped column

If it’s about preferring by non selected columns, that’d be fine.

Something is wrong somewhere imo


I assume that query optimizers are able to detect a GROUP BY of a primary key and other columns in the same table and optimize accordingly. But using this trick to change the set of valid queries seems odd. I think I'd rather see special syntax for it:

    GROUP BY table.*


> you’re allowed to use values you haven’t grouped by and haven’t aggregated? Sounds like MySQL to me

The default settings of MySQL have forbidden this since MySQL 5.7, released in October of 2015: https://dev.mysql.com/doc/refman/5.7/en/group-by-handling.ht...


Who comes up with an argument for considering nulls distinct??

Imagine a programming language where `if (variable == null)` always returned `false` because nulls were considered distinct. What kind of reasoning goes into thinking that's a great idea?

> If nulls are considered distinct, then having more than one of them won’t cause a unique constraint violation.

Gosh, I would've thought the spec was fine as it was, having to spell out that the DB should consider two null values equivalent for UNIQUENES constraints is like spelling out that "two integers that have the same mathematical value should be considered equivalent". I guess they need to add that in the spec if it's missing as well.


NULL is more of a declaration that 'there is no value', than mathematical value on its own. if you compare NULL with something you don't get false, you get NULL (which is often considered same as false).

some implementations carry the same "uncomparable" logic when checking for unique constraints, while others drop the ball here.

programming language wise, it's kinda like null pointer or float NaN, both of which are also occasionally complained about, eg. indexing maps by float values can do bogus things in case of NaNs.

edit: clarified some things


Imparting any meaning at all to NULL including non uniqueness or distinction seems fraught with problems. NULL is NULL that’s it. It means nothing else.


If it can't be either behavior, then should nullable columns not be allowed in unique indexes at all?


No ideally not. If they are they must be ignored.

If other columns in the row index are unique regardless of the value of the nullable column, then the row is logically unique. So if in that case the NULL value is ignored, then OK. But two NULLs neither match each other nor are they distinct from each other. You just cannot say either way.


A few DBs actually do that.

It can be rather limiting, though also less surprising (at runtime) than either other behaviour.


The index only should apply over non-null values. I'm not sure how anything else makes any practical sense.


That is equivalent to the DISTINCT semantics.

First, it should be noted that the reason this toggle was added is because the spec was ambiguous and different databases have different defaults, specifically:

- BigQuery and Db2 forbid nullable columns in indexes, solving the issue

- most databases default to NULLS DISTINCT

- SQL Server defaults to NULLS NOT DISTINCT

- as usual Oracle huffes glue in the corner (it generally uses NULLS NOT DISTINCT, except in the case where every column in the constraint is null in which cases it uses NULLS DISTINCT —- I assume because it just doesn’t apply the constraint at all)

And of course something else makes sense, because SQL’s NULL covers multiple use cases (UNKNOWN, missing, intentionally left out) they leave a lot to the details of the modelling.

For instance let’s say you’re modelling an FS-like structure, you have records which have a parent and a name, and the name needs to be unique within a parent.

That constraint needs to apply the the top-level items still, and while you could model top-level-ness through a special-case hard-coded record, you could just as well model it by leaving the parent field NULL, in which case you’d need `UNIQUE (parent, name)` where “parent” is NULLS NOT DISTINCT. This is especially common under under organic growth.

Adding a separate boolean flag is redundant and distinctly ugly, I’m not sure expressions are supported by every db (and they’re not free) and while sometimes a sentinel value can be used, that’s not the case in general.


2024 :)


SQL only hangs around because it was the best thing available at the time when DBs first exploded into common use. It’s pretty declarative, which is a point in its favor, but it’s not very amenable to query construction since it has AppleScript levels of “being like English.” Have fun writing something that can generate it while handling highly-adaptive and/or customizable requirements.

I’m not going to call Datalog the definitive answer, but it’s a step in the right direction.


Honestly, this really confuses me. On the surface, it seems what you said is true. How can a programming language that emulates natural language be good?

However, in practice, SQL is surprisingly easy to write and expressive. I do prefer using ORMs for simpler queries because of intellisense and typing, but for complex queries, SQL is just better. I really don't understand why.


When you're dealing with SQL you're dealing with items in sets, and really all you're doing is joining, slicing and filtering sets of data to obtain the results you want, perhaps with some extra sugar like aggregation, but in the main it's groups of objects and that's quite intuitive.

On the opposite side is object-orientation where you're dealing with instances of objects. Very intuitive when dealing with properties of an instance (perhaps columns in a table for a particular instance) but not when dealing with multiple instances. Instances of objects don't really gel with sets of objects for various reasons - this class of problems is known as object-relational impedance mismatch (Ireland et al, 2009; Colley et al, 2020). This is why ORMs are often used for development, as a smooth translation layer between object-orientation and set-orientation. Each paradigm is perfectly good within its own context, but they are not particularly compatible approaches.

References if you're interested:

Ireland, C., Bowers, D., Newton, M. and Waugh, K., 2009, March. A classification of object-relational impedance mismatch. In 2009 First International Confernce on Advances in Databases, Knowledge, and Data Applications (pp. 36-43). IEEE. https://ieeexplore.ieee.org/abstract/document/5071809/

Colley, D., Stanier, C. and Asaduzzaman, M., 2020. Investigating the effects of object-relational impedance mismatch on the efficiency of object-relational mapping frameworks. Journal of Database Management (JDM), 31(4), pp.1-23. https://www.igi-global.com/article/investigating-the-effects...


Being english-like is just appearances. SQL is really just relational algebra (the traditional querying bit of it, at least) with some technical limitations. But in essence, it's relational algebra and it works well because the topic was studied a lot by many smart people and the language reflects these developments.


You sound like you mostly write ORM-like OLTP data retrieval. As somebody who's been around analytical dbs for the last 6-7 years I can tell you that SQL falls apart almost immediately on anything longer than 20-30 lines.

It has many, many problems.

And the standard is one of them.


I wonder what the problem is, how exactly is it falling apart: a) you need to write a lot of "clean up" sub-queries and fragments or b) the question you ask is more complicated and does not fit into the algebra of SQL.

If it's more the first: if your dataset was perfectly cleaned up (say, 3NF, with all data values cleaned up) — how longer would SQL go for you without falling apart?

As I write this comment I realize that it could be worded as "is the problem with the logic or the data"? Thanks,


I've had to help some of our data scientists debug their queries after we migrated to Hive 3, and it's mostly the way large queries are structured. You often want to add a single column to the table that is referenced by additional columns (like, convert date of birth to age, then create age buckets and an is_minor flag, calculate the age when they opened their first account).

CTEs are the best way to do this but get clunky and repetitive as you have to give each one a name. CROSS LATERAL JOIN is completely illegible. Something like LET column = expression would have improved that:

    FROM customers
    LET age = EXTRACT(YEAR FROM current_date() - dob)
    LET age_bucket = CASE ... END
    LET date_1st_ac = (SELECT MIN ...)
    ...
    SELECT ...
The other problem lies in the fact that an SQL query is one huge query. You have to take it apart to sanity-check the intermediate result: okay, let me quickly turn this CTE into the main query and run it, no, let me try again with another CTE, okay, found it, now I have to convert everything back to run the whole query again.

In comparison, Spark dataframes are all made equal, I can name any one of them, examine their schema, run them to look at the intermediate result, all without introducing optimization barriers.


I'd say there are problems on 3 levels:

1. All the little things, inconsistencies and annoyances. You know, NULL behaviour, identifier usage, things that you just have to memorise (everything is non-orthogonal).

2. Language-level. No real abstraction, no code reuse, weird sublanguages for trivial stuff, no libraries/modules/packages.

3. Standard-level. I mean, this is not standard anybody can use. It is unreadable! I mean I can go a very long way if necessary - used to reading the original C++ and C stds. But ANSI SQL... A bottomless pit. Databases do implement subsets of something ANSI-ish but these vary a lot.

All of this results in templated queries, 1000s of lines of queries that are only slightly different and no chance for a truly db-agnostic language. I mean... Have you seen write-once 3000 LOC queries? This is what I can "fall apart"!

I like the original Codd's relational alebra, even love it. It is both simple and practical. Most databases use RelAlg-like internal representations that optimisers work on for a reason!

But we're stuck with SQL and it will never go away.


SQL is just too old and too powerful to be replaced by something. There might be implementations that improve on some subset of it's functionality. But nothing can possibly cover everything it does without itself becoming kludgy feeling


> However, in practice, SQL is surprisingly easy to write and expressive.

SQL is not expressive. Compare SQL queries to LINQ queries, and you'll see a notable difference in expressiveness, abstraction and reuse.


> SQL is surprisingly easy to write and expressive

Counter-example: using only majority-support ISO SQL string functions and operators (SUBSTRING, CHARINDEX, LIKE, NOT LIKE, LEN, etc - but not any regular-expressions), for a string column that contains URI query-string style name+value pairs, e.g. "foo=bar&baz=qux", try to extract out the value for the key "baz" - note that the source column value could also be "&baz=qux&foo=bar&" or "baz=qux&foo=bar&".

Doing that is trivial in most conventional programming languages: just use a couple of local-variables to store CHARINDEX values then pass those into SUBSTRING to get the desired "qux" value.

...but in SQL (at least to my knowledge) there's no concept of local-variables in a SELECT query, instead you need to pass values along via derived-tables or CTEs - or move the logic to a scalar UDF - or give up trying to be succint and do it with dozens of identical CHARINDEX call-sites and pray that the RDBMS is smart enough to memoize UDFs in a query step; none of those options are appealing even in their own right - so we're kinda stuck (and if you want to be succint you'll need not only UDF support, but also inlined and concurrent execution of UDFs otherwise your query performance will tank).

Instead, what if ISO SQL had the ability to define in-query "mini" pure-functions, which represent a (monadic?) sequence of operations, including mutable local variables, over its input parameters - this would solve the verbosity/DRY problem entirely - and without needing you to split a query's logic into two (i.e. the main SELECT query but also many CREATE FUNCTION statements). I note that the existing CTE syntax could be adapted for defining those functions, for example a query to extract querystring values could look like this:

``` WITH FUNCTION getQueryStringValue( input varchar, key varchar ) RETURNS varchar NULL AS (

    SET key      = CONCAT( key, '=' );
    LET keyIdx   = CHARINDEX( haystack: input, needle: key );
    IF keyIdx < 0 THEN RETURN NULL;

    LET valueIdx         = keyIdx + LEN( key );
    LET nextSeparatorIdx = CHARINDEX( haystack: input, needle: '&', startIndex: keyIdx );
    IF nextSeparatorIdx < 0 THEN SET nextSeparatorIdx = LEN( input );
    LET valueLen         = nextSeparatorIdx - valueIdx;

    LET value            = SUBSTRING( text: input, startIndex: valueIdx, length: valueLen );
    RETURN value;
), cte AS (

    SELECT
        getQueryStringValue( src.QueryStringA, 'foo' ) AS fooValue1,
        getQueryStringValue( src.QueryStringB, 'foo' ) AS fooValue2,
        getQueryStringValue( src.QueryStringC, 'foo' ) AS fooValue3
    FROM
        src
) SELECT CONCAT( fooValue1, fooValue2, fooValue3 ) AS whatevs FROM cte ORDER BY someCol; ```

Thoughts?


I don't disagree things like this are pain points, but one of the key things about unlocking the power of relational algebra is having normalized data, and putting multiple things (eg a bunch of key value pairs) in a single value sort of flies in the face of that. You can often get away with it, but it starts breaking down in more complex cases like this.

In the real world of course, sometimes it happens anyway. Maybe you were storing URIs and only later figured out you needed to extract keys from them. But thinking about this sort of thing beforehand as much as possible can save you a lot of heartache later on.

That said, this fails the ISO SQL requirement (I believe), but I'd use the table-valued function STRING_SPLIT in MS SQL, then I can just filter on LIKE 'baz=%", and SELECT the substring after baz=. Table-valued functions are great. The nice thing about this solution is it's very easy to generalize.

I believe MySQL doesn't have a similar function, but you could do something with SUBSTRING and LOCATE - search for the first occurrence of baz=, then you can use that to find the first location of & after baz=, which will allow you to select the substring between baz= and &.


With complex queries and query optimization, I will take SQL over Datalog every single time


How come? I've only played around with Datalog. Any production war stories you have to share?


Let's not erase the work of Stonebraker et al whose efforts through the 70s were a lot closer to Codd's relational algebra than SQL. Ingres's query language, Quel, springs to mind: being directly based on the relational calculus it was far more readily composable than SQL (which might be charitably described as "borrowing ideas from the relational model") was or is.

As with the videotape format war, the eventual industry standard was decided commercially: Oracle was by far the most aggressive, and promoted SQL.


Yes, we can have relational databases without SQL, there are better ways for composability. Some might argue that SQL itself isn't a mess, but it sure does help people make a mess of themselves.


> not very amenable to query construction

I had a period of perferring MongoDB for this reason. Because the query is the bare AST in Mongo, dynamic query construction is simply manipulating some JSON.

Then I used jooq and never looked back.


I’ve long felt that SQL was somehow backwards, but I don’t know what would be better. ‘Go and get two eggs’ Is better than ‘go to the fridge and get eggs, only two’.


> but I don’t know what would be better.

Strictly following the semantics order of the querying, whether forward or reverse.

The ordering issue of SQL is that it uses a mixed ordering, the physical layout of the query in text form doesn’t follow the logical execution of the query, which makes it very confusing.

Comparing to short cutesy sentences is not helpful as a non-trivial query has dozens or hundreds of steps, it’s closer to a build book or checklist.

Furthermore english is far from a paragon of clarity or regularity, and languages which try to follow it (SQL, AppleScript) tend to scale a lot worse with instruction sizes and complexity.


After using moonshine MongoDB pipelines I would say that it is a fantastic model.

You can sort of already do this in SQL with CTE* but a nice pipeline syntax would be welcome.

* Actually I believe that you can do more with CTE as you can mix the various steps in arbitrary ways


Wouldn't a better analogy for SQL be 'Get the top two eggs from the fridge'?

It'd translate to

select top 2 eggs from fridge;


That's not standard SQL. This was standardized pretty late, so different RDBMS have various syntaxes, but the standardized one is to add a clause like this:

    FETCH FIRST n ROWS ONLY
In PostgreSQL and MySQL the usual syntax is a LIMIT clause, but PostgreSQL also supports the standard. MSSQL and some others use TOP n directly after SELECT. I think in some others you need to use a window function.


Top n is even deprecated in mssql, in favor of top(n).


Select top(eggs) from fridge limit 2


I would love the next revision of the standard to include QoL stuff like:

- SELECT column aliases being visible from WHERE and GROUP BY clauses, or even from the following expressions in the same SELECT clause

- JOIN other_table ON FK

- GROUP BY ALL

- SELECT * EXCEPT column


ClickHouse supports this and more: https://www.youtube.com/watch?v=zhrOYQpgvkk


Snowflake’s dialect supports the first one. I discovered that accidentally and immediately fell in love.


The SQL Standard appears to me to be some sort of political tool, likely developed and used by Oracle. It is the most significant standard I know of that isn't publicly available.

Seriously, if you want a legal copy that costs something like 200 Swiss francs. This isn't how serious comp-sci standards work (the TCP people just throw out RFCs). Especially since the SQL standards committee probably don't know how to design a language given what most SQL implementations look like. I can't be sure though, I'll probably never be in a position to read the official standard.

Reflect on the madness. Postgres? Free. SQL Standard? 200 francs. Postgres makes a better standard than the SQL standard. The ISO standard thing isn't here to support people using or building SQL databases.


I sometimes fail to understand the disconnect in some of us here: you want six figure salaries, AND have standardization committees work for free (and all software and AI models should be provided for free as well).

This would all make sense if you were in favor of fat governments, but that does not seem to be the case either.

The entitlement baffles me. Or am I accidentally combining opinions of different people and making a misjudgment?


I'm not asking for anything; the postgres documentation is more important to me than the standard and already free. I'm pointing out that the standard isn't intended for implementers or users. By ruling out options, my guess is it is intended for governments to select Oracle DBs rather than alternate (OSS?) DBs - or just to put petty impediments in the way of new DBs trying to get started. Otherwise why is the standard the only component of the entire chain that is not free on the internet? It would be bizarre if ONLY the standards committee needs $$$ to operate. They have the easiest job and there are natural incentives for the different SQL implemeters to cooperate. Why else would the core postgres maintainers be expected to pay 200 francs for the privilege of getting a copy of a standard that they are implementing for free? That way lies madness.

And we know the standards committee would be willing to work for free or people are available who would do the work (indeed, we can probably recover the standard from the postgres docs since they mention the things that aren't standards compliant - https://www.postgresql.org/docs/current/features.html). The postgres maintainers do notably more work than the standards committee for free. How hard is it for them and Dr Hipp to throw out a document saying where the common ground is? It is far less work than they've already put in to the free docs. The useful parts SQL standard is an appendix compared to https://www.sqlite.org/docs.html

At least, I assume. As previously mentioned, I've never read the standard and as far as I can tell nobody is allowed to link it for me.

EDIT I feel like I should link some standard here, so in lieu of the SQL standard which I can't link, here is the TCP standard: https://www.ietf.org/rfc/rfc9293.html


It is entirely possible for people to be paid to produce something that is made available at no cost to certain other people. For example, that hand sanitizer dispensor at the pet store. Yes, somebody is paying for it.

In the case of a standards body, the people that do the work can get paid in various ways. The finished product can be published "for free".

In some research contexts this could very much be expected. If researchers were paid through tax-funded grants, for instance, the results of the research should be available to those tax payers. If they are also available to non-taxpayers, perhaps there is some altruism at play.

It's not an absolute given. But it's not an unreasonable expectation.


Of the top of my head, there are also:

- ISO 8601: dates and times

- ISO/IEC 9899: the C standard

- ISO/IEC 14882: the C++ standard

> This isn't how serious comp-sci standards work (the TCP people just throw out RFCs).

This isn't how most _web_ standards work.


True, although for both C and C++, standards drafts are prepared on GitHub. At least for C (but I presume also for C++) you can check out a revision which is functionally equivalent to the ISO standard (but minus the branding).


FWIW, the PDF spec also costs some crazy amount like that for something so ubiquitous.


As of 9 days ago, the PDF 2.0 spec has become freely available: https://www.pdfa.org/announcing-no-cost-access-to-iso-32000-...


It's how most standards work outside of comp-sci though.


title is confusing.

SQL announces: 2023 is finished, on April 13th, 2023.


It’s the space after the colon in the submission title (was that the infamous HN title normalizer?). The original has SQL:2023, which makes sense if you know that the ISO naming scheme introduces the year with a colon (e.g. “C99” is officially ISO 9899:1999, “C95”—that is ANSI C plus wchar.h and wctype.h—is ISO 9899:1990/Amd 1:1995).


It made perfect sense to me at first but now that you mention it, the title is quite funny. I guess if "SQL" says 2023 is over, who are we to argue - happy new year everyone!


Cool. Let's see if Postgres will decide to implement this one. We're 12 years on from 2011 with no implementation.


It's practically ripped straight from PostresSQL documentation.

PostgreSQL already supports:

* ORDER BY in grouped table (F868)

* GREATEST and LEAST (T054)

* String padding functions (T055)

* Multi-character TRIM functions (T056)

* Optional string types maximum length (T081)

* Enhanced cycle mark values (T133)

* JSON data type (T801)

* Enhanced JSON data type (T802)

* String-based JSON (T803)

* SQL/JSON simplified accessor (T860–T864)

The upcoming release of PostgreSQL 16 will have:

* Non-decimal integer literals (T661)

* Underscores in numeric literals (T662)

PostgreSQL currently lacks:

* UNIQUE null treatment (F292)

* ANY_VALUE (T626)

* Hex integer literals in SQL/JSON path language (T840)

* SQL/JSON item methods (T865–T878)

* JSON comparison (T879–T882) - though the nonstandard JSONB has this

* Property Graph Queries

---

As with many standards, the SQL standard likes to include things that DBMSs already have in the wild and have found practically useful.


Postgres has support for specifying NULL behavior in UNIQUE constraints since version 15.0 (released in 2022), with the same syntax as described in the article.

See the end of the documentation on unique constraints: https://www.postgresql.org/docs/15/ddl-constraints.html#DDL-...


Ooh, fancy

I'm out of date :)


> SQL/JSON simplified accessor (T860–T864)

Where is the doc for this one? I can find Postgres’ own JSON(B) operators, as well as jsonpath (sql/json path language) support, but not SQL-level json paths, specifically the dot syntax.


Yeah, postgres uses proprietary syntax for this AFAIK. Would be nice to see t860-t864 supported.

Then there's the whole graph data stuff that seems to come straight from Oracle?

https://pgql-lang.org/


Man, there's few pieces of software that I love quite like Postgres.


Some of this has already implemented in PostgreSQL 9.6! lpad / ltrim, for example.


I wonder if MySQL will have a new version for these features or if they will continue to milk version 8.


Honest question; why use MySQL over postgresql?


Vitess / PlanetScale


Would love to see a more composable UNION in the future.

Instead of: select * from a UNION select * from b UNION select * from c

I'd like to do: select * from UNION (a, b, c)


If SQL ever standardizes that table_name alone is equivalent to SELECT * FROM table_name, you will be able to write: a UNION b UNION c


I think we need a transpiler that allows writing SQL 2023, while having specific backends for PostGres and MySQL. What babel / TS did for JS

This will make the standard process a lot easier, as community can try out new features first before actually building a standard.

It really feels like they are shooting in the dark, with major focus on features and light on the actual DX of devs using SQL day to day.


My fear with yet another transpiler is that we start working on layers of tech. I like Postgres for its directness w/ the DB. You query, you get it.


Definitely agree! I've never taken a liking to SQL but using transpiler libraries has really improved the DX, and made me not dislike it so much.


I can’t wait for Property Graph Queries


Any_Value is a strange addition. I'm struggling to think of a scenario where I need a value from a group of records, but don't care which one.

Why not just use min, max or even row_number to specify an order? If the value isn't important, do you really need to present it?


Here a common example:

Given the following tables:

  CREATE TABLE person (
    id INT PRIMARY KEY,
    name VARCHAR NOT NULL
  );

  CREATE TABLE pet (
    id INT PRIMARY KEY,
    name VARCHAR NOT NULL,
    owner_id INT NOT NULL
  );
It is common to want to join them, like so:

  SELECT owner.id, owner.name, COUNT(pet.id) AS numberOfPets 
  FROM person AS owner
  LEFT OUTER JOIN pet ON owner.id = pet.owner_id
  GROUP BY owner.id
This doesn't work in standard SQL, because all columns in the SELECT list have to either be aggregated or included in the GROUP BY. owner.name is neither. That is a bit silly though because we know each result row will only have one unambiguous value for the owner name, since the GROUP BY is on a unique column from the same table as the owner name.

We can solve this with ANY_VALUE:

  SELECT owner.id, ANY_VALUE(owner.name) AS name, COUNT(pet.id) AS numberOfPets 
  FROM person AS owner
  LEFT OUTER JOIN pet ON owner.id = pet.owner_id
  GROUP BY owner.id


I run into the equivalent problem in Kotlin sometimes. I want to group things by a key, but that results in type `Map<K, List<V>>` because the compiler can't be sure that there won't be more than one `V` for the field being used for grouping... which is why there's also `associateBy` which does return `Map<K, V>`, but will throw if there happens to be more than one value for each K.

Not relevant, but I thought it may help other readers understand the problem.


The common solution in this case is simply to also GROUP BY owner.name.

That said, if owner.name has different values and you didn’t actually care which one you’d get ANY_VALUE could maybe be useful.

Though in that case you could also just use MIN or MAX, which already exist.


Yes, but all these solutions require the optimizer to be smart enough to be smart enough to optimize them away: "oh, owner.id is the primary key, so any value from owner will appear at most once, I don't have to group by them or track the maximum value, I can pick and store any value".

I think all optimizers are smart enough already, as this is a super common pattern. I honestly would have preferred this logic to be made more explicit: if you group by the primary key (or by the table name, as someone suggested), you can use all columns from the table without aggregate functions.

This kinda breaks down when you introduce subqueries and CTEs, which is where myopic optimizers can screw the pooch as well, so any_value can be of use there.


We use it to select any value from an aggregated list of values which we know are identical. It requires fewer operations than e.g. max.


What I'd really want in that example is a function that errored if they weren't all the same.


Any circumstance where it would be needed should have given a syntax error irrespective of the data present (all matching or with differences) – everything in the projection list needing to be in the grouping clause or an aggregate.

For a very simple but slightly odd (why would you ask exactly that?) example:

> SELECT FirstName, WorkEmail FROM Person GROUP BY WorkEmail

should give an error in current SQL implementations. Despite the fact that in the data in question there is no ambiguity as the email addresses are never the same between two people, the database has no way of knowing that. The error message in SQL Server in this case is “Column 'Person.FirstName' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause”. In this case you could declare WorkEmail to be unique, but that should still error as the standard doesn't take this into account. The new ANY_VALUE us counted as an aggregate so “fixes” the syntax error like so:

> SELECT ANY_VALUE(FirstName), WorkEmail FROM Person GROUP BY WorkEmail

Basically it is a way of telling the query planner that you know your data better than it does, and it shouldn't worry its little head and take the first thing it sees to project for that column.

mySQL actually does this by default, or at least it used to, using the first thing it sees for columns that are neither aggregates nor part of the grouping clause, and you'll find any number of people asking why this doesn't work in other databases (SQL Server, postgres, …) when they try porting things over.

--

ANY_VALUE() feels icky to me, similar to how the related mysql behaviour always did. It is explicitly giving an opening for the output to change arbitrarily for a given query and that makes me think it is going to be the source of difficult to find bugs, for instance when some code relies on a particular output but changes in data balance hit a tipping point and make the query planner take a different route meaning ANY_VALUE() returns something other than it has in that place for months/years.

I can think of examples where it could be useful or at least makes things no worse (you need to get a little more complex than my single table example above), i.e. where trying to disambiguate with another aggregate like MAX still gives effectively arbitrary results, trying with ROW_NUMBER() tricks adds complexity (both for the human reading the code and the query planner), and adding the column to the GROUP BY would cause duplicate rows otherwise, but I'd rather not have it because of the issues that it could easily create in the wrong hands.


It's actually super useful if you've got a table with two columns like "sku" and "product_name", where the value in both columns has one unique value from the other column paired with it. You can group by sku, and then grab any value for product_name, since they're all the same. Using min/max/etc involves extra computations that you don't need to do.

You could group by the combination of sku+product_name, but it's kind of unwieldy. You could also normalize it to not have this issue, but when you're getting this data from someone else, there's not much you can do.


I guess SQL is so widely used that there are always some niche cases. I guess it is useful if you just want to show one example value where there is no numerical order, and you want fetching this value to be superfast.

Like a movie database, and you want to show an example title for every genre.

But in a real-world scenario, I would want a popular movie, not literally any. If you have an Academy Award winners table, I guess, then everyone is popular.


Still no explicit support for enums... best thing about Postgres in the brief period I used it was that. Looking at tables full of integer values and trying to remember what each mean does everyone's head in IME.


  ORDER BY in grouped table (F868)
Finally! This will let one have control over selected columns whose value would otherwise be randomly selected from the group


this whole nulls distinct v not distinct is a side-effect of the standard not being more explicit in what nulls really mean -- so to me adding this distinction in the unique constraint looks like a code smell if one could apply that term to standards.


whoa Property Graph Queries being added to the standard is huge imo


Will this be part of the ANSI SQL standard?


The ISO standard is all there is. From https://blog.ansi.org/2018/10/sql-standard-iso-iec-9075-2016...

'''

The “ANSI Standard” for SQL

There is no such thing as “ANSI Standards,” as ANSI does not develop standards. Instead, there’s American National Standards and other documents written by ANSI-approved committees of standards developing organizations. Nonetheless, we get a lot of inquiries for the “ANSI standard” for SQL. It’s worth noting that, while this phrase is misleading and inaccurate for numerous reasons, it is referring to existing standard documents. SQL, just like many great things that outlasted the 70s, has a powerful history auspiciously intertwined with standards. At the conception of SQL’s specifications was ANSI (just another great feat throughout our 100-year history).

The Current Edition of ISO/IEC 9075 for SQL

If you don’t want to read all the history on SQL that we’ve detailed below, to make things short, SQL was standardized in ANSI X3.135 in 1986, and, within a few months, it was adopted by ISO as ISO 9075-1987. While most vendors modify SQL to meet their needs, they generally base their programs off the current version of this standard. The international standard (now ISO/IEC 9075) has been revised periodically ever since, most recently in 2016. It exists in 9 parts: [...]

'''


Thank you for enlightening me with that detail. I’ve taken for granted that there is a generally standard version of SQL and the token term for it is ANSI SQL. I was in T-SQL (Microsoft), PL-SQL (Oracle), and DB2 SQL (IBM) dialects and now I feel at home with PostgreSql (ANSI SQL) which generally works across other DB platforms.


> There is no such thing as American National Standards Institute standards.

:/


This is the latest SQL Standard.



"has finished" ffs


I wonder what the revenue per developer is for the SQL server team. It must be enormous at this point. Obviously declining as Postgres is so good but lots of companies / DBAs still demand SQL server (and it is not cheap).


The article is about SQL, the language, not Microsoft SQL Server.


Sorry, I'm dumb.


Funny story. at Microsoft it is common to call anything having to do with "SQL Server" simply "SQL". It was even on internal tchotchkes, like a "Linux <3 SQL" fridge magnet, presumably alluding to Linux support for Microsoft SQL Server. Since I came to them by the way of Citus acquisition and the Postgres world generally, this was first confusing (where SQL, to me, always meant the family of implementations, and the standard) and then amusing...but not quite endearing, in its imperiousness.

On the other hand, they could have done worse than to name their database product the both bland and imperious "SQL Server:" they could have repeated their performance in naming "WinCE" as is the customary abbreviation of "Windows," plus "Compact Edition" which, if you simply type it out, spells "wince."


How many devs to needed to keep SQL server running? Dozens? Hundreds? (1000?). Revenue has to be in the billions right?


I'm thinking less than 100 at this point, with revenue in billions.


It saddens me to see how misguided some groups attempt at language creation are. Contrast these 2 quotes:

(greatest/least) "To be clear, in most programming languages, these functions would merely be called max and min. But in SQL, those names are already used for aggregate functions. Since there is no syntactic difference between normal functions and aggregate functions, you have to pick two different names."

"After 2-3 years the language had grown by accretion.... Eventually I found it was shrinking, once you look at enough different applications you come to general notions, these generalizations allowed me to take out whole chunks of special things and furthermore the general ideas are usually much simpler to understand than the special cases."

The easiest path for someone creating a language or API is to add a new layer/function for everything. It solves the immediate problem, it allows you to create something new, it can't break existing code. As a java programmer I often find myself creating an abstraction layer without thinking. What's made this much worse actually is google and stackoverflow, people can't google general solutions but they can google for "convert php string to camel case". However when I look at beautiful implementations of languages, they are beautiful in their exposed simplicity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: