Key takeaway, LLMs are abysmal at planning and reasoning. You can give them the rules of planning task and ask them for a result but, in large part, the correctness of their logic (when it occurs) depends upon additional semantic information rather then just the abstract rules. They showed this by mapping nouns to a completely different domain in rule and input description for a task. After those simple substitutions, performance fell apart. Current LLMs are mostly pattern matchers with bounded generalization ability.
People also fall apart on things like statistical reasoning if you switch domains (I think it is the Leda Cosmides evo psych stuff that goes into it but there might be a more famous experiment).
Key takeaway, LLMs are abysmal at planning and reasoning. You can give them the rules of planning task and ask them for a result but, in large part, the correctness of their logic (when it occurs) depends upon additional semantic information rather then just the abstract rules. They showed this by mapping nouns to a completely different domain in rule and input description for a task. After those simple substitutions, performance fell apart. Current LLMs are mostly pattern matchers with bounded generalization ability.