You either don't know which training data was used for say chatgpt oss, or training data can be included into some open dataset like pile or similar. I think this test is very unreliable, and even if someone come to such conclusion, not clear what is the value of such conclusion, and if that someone can be trusted.
My intuition tells me it is vanishingly unlikely that any of the major AI labs - including the Chinese ones - have fine-tuned someone else's model and claimed that they trained it from scratch and got away with it.
Maybe I'm wrong about that, but I've never heard any of the AI training experts (and they're a talkative bunch) raise that as a suspicion.
There have been allegations of distillation - where models are partially trained on output from other models, eg using OpenAI models to generate training data for DeepSeek. That's not the same as starting with open model weights and training on those - until recently (gpt-oss) OpenAI didn't release their model weights.
> An unnamed OpenAI executive is quoted in a letter to the committee, claiming that an internal review found that “DeepSeek employees circumvented guardrails in OpenAI’s models to extract reasoning outputs, which can be used in a technique known as ‘distillation’ to accelerate the development of advanced model reasoning capabilities at a lower cost.”
Additionally, it would be interesting to know if there is dynamics in opposite directions, US corps (oai, xai) can now incorporate Chinese models into their core models as one/several expert towers.