We use it to select any value from an aggregated list of values which we know ar...

OscarCunningham · on April 14, 2023

What I'd really want in that example is a function that errored if they weren't all the same.

dspillett · on April 14, 2023

Any circumstance where it would be needed should have given a syntax error irrespective of the data present (all matching or with differences) – everything in the projection list needing to be in the grouping clause or an aggregate.

For a very simple but slightly odd (why would you ask exactly that?) example:

> SELECT FirstName, WorkEmail FROM Person GROUP BY WorkEmail

should give an error in current SQL implementations. Despite the fact that in the data in question there is no ambiguity as the email addresses are never the same between two people, the database has no way of knowing that. The error message in SQL Server in this case is “Column 'Person.FirstName' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause”. In this case you could declare WorkEmail to be unique, but that should still error as the standard doesn't take this into account. The new ANY_VALUE us counted as an aggregate so “fixes” the syntax error like so:

> SELECT ANY_VALUE(FirstName), WorkEmail FROM Person GROUP BY WorkEmail

Basically it is a way of telling the query planner that you know your data better than it does, and it shouldn't worry its little head and take the first thing it sees to project for that column.

mySQL actually does this by default, or at least it used to, using the first thing it sees for columns that are neither aggregates nor part of the grouping clause, and you'll find any number of people asking why this doesn't work in other databases (SQL Server, postgres, …) when they try porting things over.

--

ANY_VALUE() feels icky to me, similar to how the related mysql behaviour always did. It is explicitly giving an opening for the output to change arbitrarily for a given query and that makes me think it is going to be the source of difficult to find bugs, for instance when some code relies on a particular output but changes in data balance hit a tipping point and make the query planner take a different route meaning ANY_VALUE() returns something other than it has in that place for months/years.

I can think of examples where it could be useful or at least makes things no worse (you need to get a little more complex than my single table example above), i.e. where trying to disambiguate with another aggregate like MAX still gives effectively arbitrary results, trying with ROW_NUMBER() tricks adds complexity (both for the human reading the code and the query planner), and adding the column to the GROUP BY would cause duplicate rows otherwise, but I'd rather not have it because of the issues that it could easily create in the wrong hands.