Yes, I have similar concerns. These models regurgitate previously seen strings, previous benchmarks included. When you try to evaluate their sheer ability to reason on the text, however, they perform poorly.
(Our experiments with GPT-3 are here: https://doi.org/10.5220/0012007500003470)