It's in the LLaMa.cpp original research paper. Tis mentioned in the brief. The research paper basically stated it was trained on Bibliotik, and other Internet "shadow library" corpuses.
See the reference to the Gao et al, in the linked paper from the article.
The LLaMA paper references a paper utilizing a data source compiled by EleutherAI otherwise known as ThePile. URL from the bibliography for that paper points yonder:
https://zenodo.org/record/7413426
This act of summarization done in a lovingly amateur nature at no cost to you, by domeone who despises copyright in all it's forms, but despises profit oriented self-referential inconsistency by large enterprises even more so.
It's kind of funny, because the more I look into it, the more companies building offerings around stuff like CoPilot, LLaMA, ChatGPT, etc... are pulling something not altogether dissimilar to a Sovereign Citizen trying to worm their way out of a speeding ticket.
They want the benefits of the ML model being trained on no strings attached data corpora, while shirking the obligations that come from operating as a corporate entity in the United States.
Twould be interesting to see if Silverman's legal team can catch Big Tech with their pants down, in a court of law, by pointing this out.
It's really weird. I'm completely split and inable to live with a decision either way in this case due to knock on consequences.
I don't want the likes of OpenAI/Meta/Microsoft/Github getting off without reapong the painful fruits of their own IP related crusades on the sanctity of copyright.
On the other hand, as much of a karmic stiffy as that former outcome gives me, I really want copyright such as it is to die, because computing in general will never be as free as it should be until it does.
This is one of those rare times in life where I'd love to get paid to get locked in a room with judges/legislators to really get it all figured out, because I really don't think that leaving this up to common law jurisprudence is actually the best way to go since the network of knock-on effects are so dramatic in scale.
* Wowfunhappy said "I really don't see how you could prove OpenAI did that", verve_rat replied "they admitted it in public", I asked "where OpenAI have admitted this" and noted "OpenAI are still secretive about their training datasets" - specifically about the OpenAI claim
* LLaMA(.cpp) is (an unofficial implementation of) Facebook's leaked model
On balance of probabilities I'd guess that OpenAI did train on material not legally acquired, but as far as I'm aware they've never actually admitted to what's in their dataset as is being claimed.
> It's really weird. I'm completely split and inable to live with a decision either way in this case due to knock on consequences.
I think strengthening of IP law risks hindering the field (of which the majority is uncontroversially positive but too boring for press attention, like defect detection, language translation, spam/DDoS filtering, agriculture/weather/logistics modelling, etc.) while still ending up hurting individuals and FOSS/academic research more than those with large data moats (Microsoft with Github repos, Google with Youtube videos, Adobe and Getty with stock images, etc.)
> It seems pretty easy to prove that, since they admitted it in public.
Can you highlight/link to where OpenAI have admitted this? As far as I'm aware, OpenAI are still secretive about their training datasets.