Is training a model on second-hand data laundering copyright? Second-hand data i...

Is training a model on second-hand data laundering copyright? Second-hand data is data generated from a model that has been trained on copyrighted content.

Let's say I train a diffusion model on ten million images generated by diffusion models that have seen copyrighted data. I make sure to remove near duplicates from my training set. My model will only learn the styles but not the exact composition of the original dataset. So it won't be able to replicate original work, because it has never seen any original work.

Is this a neat way of separating ideas from their expression? Copyright should only cover expression. This kind of information laundering follows the definition to the letter and only takes the part that is ok to take - the ideas, hiding the original expression.