Indeed, consistency (with the lengthy dialogue histories, but also the game state as well as game plans) was a huge challenge for us. We spent a lot of time working on techniques for detecting and filtering these kinds of low quality messages. You can see the Supplementary Materials in the paper for full details, but TL;DR: we built a suite of classifiers for detecting common mistakes (training classifiers to discriminate between human messages and counterfactuals), and used these classifiers as an ensemble which acted as a filter on top of message generation.
Did you capture conversational "transactions" as structured data in the game state, or was the chat history itself the only storage for that aspect of the game?
I would think you could avoid much of this issue by creating a more sophisticated structured game model and use the language model only for converting between structured and unstructured data.
They do have structured game model although it doesn't capture everything in chat. Language model still had lots of problems with consistency even with structured game model input.
Congrats on getting the related research published.
Feel like a hack would have been to try to force dialogue into an extractable form that stored a state model relevant to the game, even additional hacks like asking the opposing player to restate their understanding of prior agreements; disclosure that I have no idea how the game Diplomacy works, so might be irrelevant.
Beyond that, no idea how Facebook manages its AI research, but quick Google confirms my memory that Meta/Facebook has done prior research on enabling AI memory capabilities related to recall, forgetting, etc.; which I mention just in case you were not aware.