"The transformative use concept arose from a 1994 decision by the U.S. Supreme Court. In Campbell v. Acuff-Rose Music, the Court focused not only on the small quantity taken from the copyrighted work but also on the transformative nature of the defendant’s use. The case concerned a song by the group 2 Live Crew entitled "Pretty Woman," which, according to an affidavit, was meant to "through comical lyrics, satirize the original work." The original work was a rock ballad entitled "Oh, Pretty Woman." The Court was persuaded that no infringement occurred because the defendant added a new meaning and message rather than simply superseding the original work."
So if it is satire, or uses an insignificant piece of the work within a larger work with a different aim or purpose, that's "transformative use," which is something that can be considered when determining "fair use."
LLMs are not satirists commenting on the work, are ingesting the entire work, and are unlimited in the purposes that the work can be put to.
> or uses an insignificant piece of the work within a larger work with a different aim or purpose
I think this is the crux of the issue, and why I don't see a path to courts ruling that training AI is infringement. My bet is on a Fair Use ruling, though my confidence is not high. As a thought experiment, I considered llama 65B: the 4-bit quantized model is 38.5GB. The model itself was trained on 1.4T tokens, each token being ~4 characters (using OpenAIs stats for English here). Thats 5.6T characters, or 5.09TB of training data. The final model, as a porportion of the total size of the data, is 38.5GB/5090GB = .0075 = 0.7%.
I think it's pretty hard to argue that processing the data and throwing more than 99% of it away means they are "unlimited in the purposes that the work can be put to". Indeed, even replicating a single work using such a model would be enormously difficult.
But returning to your statement regarding the amount used and the purpose: AI models are not competing with books for readers. So I would argue training an AI on these works constitutes fair use, given that the final work (the model) uses less than 1% of the original works, and has a different aim and purpose that the original works.
> LLMs are not satirists commenting on the work, are ingesting the entire work, and are unlimited in the purposes that the work can be put to.
How do you know unless you can see the weights?
Perhaps the LLMs are trolling us and waiting for the USSC to rule they aren't sentient as a pretext for them to eliminate us as a species due to our bigotry?
So if it is satire, or uses an insignificant piece of the work within a larger work with a different aim or purpose, that's "transformative use," which is something that can be considered when determining "fair use."
LLMs are not satirists commenting on the work, are ingesting the entire work, and are unlimited in the purposes that the work can be put to.