This is a good talk about the problem: https://youtu.be/hGXhFa3gzBs?si=15IJsTQLs...

This is a good talk about the problem: https://youtu.be/hGXhFa3gzBs?si=15IJsTQLsyDvBFnr

Key takeaway, LLMs are abysmal at planning and reasoning. You can give them the rules of planning task and ask them for a result but, in large part, the correctness of their logic (when it occurs) depends upon additional semantic information rather then just the abstract rules. They showed this by mapping nouns to a completely different domain in rule and input description for a task. After those simple substitutions, performance fell apart. Current LLMs are mostly pattern matchers with bounded generalization ability.