People defer thinking about what correct and incorrect actually
looks like for a whole wide scope of scenarios and instead choose
to discover through trial and error.
LLMs are _still_ terrible at deriving even the simplest of logical
entailment. I've had the latest and greatest Claude and GPT derive 'B
instead of '(not B) from '(and A (not B)) when 'A and 'B are anything
but the simplest of English sentences.
I shudder to think what they decide the correct interpretations of a
spec written in prose is.
I shudder to think what they decide the correct interpretations of a spec written in prose is.