I only spent a few minutes skimming thr paper, but:
1) there are a lot of papers claiming to be the successor to the Transformer, and not all of them are cited; e.g., the MetaFormer is missing
https://arxiv.org/abs/2111.11418.
Another candidate that wasn't compares against (or at least argued why it wouldn't make sense to compare against) are the Hopfield Networks
https://arxiv.org/abs/2008.02217.
So until a more solid Related Work section is written (their section is actually called "Relation to and Differences from Previous Methods") I reserve the right to be skeptical whether their model is the "best" successor to the Transformer.
2) they say in the abstract "We theoretically derive the connection between recurrence and attention" but I couldn't find a longer theorem-proof section. So either this is done only in a cursory manner, or the proof is very easy.
Recurrence and attention have been around for a long time as concepts, so surely there are already proofs in similar contexts of this fact (I am not working in this particular area of Machine Learning, so I don't know the SOTA by heart, but I strongly suspect that these aspects have been discusses previously; thr Hopfield Network paper I linked to unearthes some theoretical facts about attention, for example).
So -based on my very cursory reading- this paper seems like an interesting approach, but I do see some holes in thr execution. Time will tell whether Rentetive Network will become mainstream or not.
Ok, this was my five minute review of the paper. Now I have to urgently return to completing my actual reviews for NeurIPS, haha.
> 1) there are a lot of papers claiming to be the successor to the Transformer, and not all of them are cited; e.g., the MetaFormer is missing https://arxiv.org/abs/2111.11418. Another candidate that wasn't compares against (or at least argued why it wouldn't make sense to compare against) are the Hopfield Networks https://arxiv.org/abs/2008.02217.
Neither of those papers are NLP applicable? And I think it's perfectly fair to focus on the alternatives (ie. like H3 and RWKV) that have been able to scale up to LLM levels and perplexity, which neither of the alternatives you mention have. Should they just cite every 'is All You Need' paper?
Not, but having a small section in the paper would be reasonable, that illustrates why the most pertinent models that might be relevant at first sight (like the ones I cited) are actually not applicable.
The onus is on the authors to place their research in context and provide compelling arguments - not on the reader to guess why their model was compared against model A, but not model B.
What do I mean by "pertinent"? Of course it is not necessary to cite every "All You Need" paper.
But:
(A) I'd argue it would be necessary to cite those "All You Need" papers that have either gathered a fair amount of citations or media attention (which is the case for both of the papers I linked to), or are meaningful descendants (in the "has been cited by" tree) of thise papers.
As I said, this is not really my field - but I would say there is some change that among the hundreds of papers that have cited the papers I linked, some have been scaled up to LLM levels and use basically the same MetaFormer/Hopfield architecture.
(B) If the above isn't the case and none of those models have been scaled to LLM levels - that's fine too. But then please tell the readers that you did due diligence and found that there actually is this gap in the literature (of course, feel free to close it yourself and then be the first one to train one of those models to such scales; that's the reward for doing a solid literature review - and who knows, maybe you stumble upon an even better model that will get you many citations).
(C) If you cannot perform a comprehensive literature search, but the models you compare against cover 90% of the models that out there (in production or research), and you can back that claim up - then, of course, you're safe too and I'd be very happy to be able concongratulate you that you really dis manage to achieve a breakthrough.
(D) Even if none of these things applt and it just costs too much in terms of computational power to train many other potentially competing models, or is too cumbersome to carry out a a comprehensive literature reading- that's also fine. You can then simply constrain your paper more and consider a more precisely defines slice of models, for which you then can actually do a through literature review and comparison. Then you'd also need to adapt your title though, so that it reflects the more precise scope. And please don't take this negatively, as a reader I'd much rather have a model that is proven with the highest scientific standard to be state-of-the-art on a more narrow scope, rather than a more broad claim, with only a moderate amount of evidence backing it up.
So, PLEASE, don't leave us guessing!
If you have a new candidate model that you claim is the "successor" -a strong word- but compare it to just 6 other models, and importantly, you don't let the reader know which of the options (A) to (C) apply, then you have to go with (D).
Machine learning is already too full with papers whose titles are overly broad. Somehow, other scientific disciplines have a much more sober title formulations, yet ML insists on colorful titles that usually are not particularly informative (and yes, "XYZ is All You Need" I consider to be an example of such a bad title).
Critiques of omission are the easiest ones to levy and are informationally asymmetric in that the person leveling them will typically have more information on the topic being omitted than the person being critiqued. Hopfield networks & Metaformers are largely irrelevant in the domain of potential transformer replacements for LLM. Unless they have been shown to be relevant in any way, I don't see why the paper need cite them (and thus build some recursive case that future papers must cite these papers).
I'll just say I'm glad that I'm not submitting papers to be reviewed any more :)
> Unless they have been shown to be relevant in any way, I don't see why the paper need cite them.
Fair. Your argument then falls precisely into category (C) of the four mutually exclusive options I outlined above.
But you'd then need to argue why the 6 models you compared against is the comprehensive model sample to test against, that contains -- and not just some arbitrary set of recent models that happen to be dominated by the newly proposed model.
(And maybe that is indeed the case; then it should be easy enough to update the arxiv draft by incorporating a section where you argue along those lines.)
Of this is true, it would be something close of an insane situation: One of the largest datasets, that the largest companies are using to train their models (probably; many of the best LLMs have technical reports that raise more questions rather than answer them) being forced to live an obscure existance on torrents.
From a scientific point of view this is very problematic because few safeguards exist that guarantee that the dataset is not tampered with (as is the case if you'd upload it to Zenodo, which providea some guarantee of immutability).
How about trying to upload the Pile to Zenodo?
Only half-joking :D
Could you share more about copyright? For example, aren't you worried that now, with all kinds of lawsuits happening [1] and copyright issues that were found in existing datasets [2], that you might get threatening letters from a lawyer some day?
I'm the author of [3] where we introduced one of the first natural-language datasets that test graduate mathematics for LLMs, but some of the prompts we took from a copyrighted book and therefore thought about excluding them. Having them in the public dataset would be really nice though, hence I'm keen about your experience.
I'd also be keen to hear how your challenge against the DMCA on sharing LLaMA's weights goes?
I think a lot of hackers shy away from doing impactful work because of fear. Sometimes those fears are justified, but it's remarkable how often things that seem like a big deal turn out not to matter. My advice for ambitious devs would be to do what seems interesting, and don't worry too much about threatening letters. Usually the worst thing that happens is that you agree to stop doing whatever generated the threat.
Personally, I'm not worried. It would be a damn shame if academics come under fire merely for trying to operate on the cutting edge of science. None of us were trying to make money; we just wanted to make something interesting.
> I'd also be keen to hear how your challenge against the DMCA on sharing LLaMA's weights goes?
Thanks! I think we might be putting up a website for it soon, if only to explain ourselves. In the meantime – I hate this phrase, since I don't want followers – the only way to keep informed is to follow my Twitter, and perhaps keep an eye on my HN comments.
You'll probably hear about it either way though, since it's a groundbreaking case. No one has tested the copyrightability of ML models before.
It’s the other way around — Meta DMCA’d llama-dl, my github repo, claiming they control copyright of llama. Our assertion is that ML weights are uncopyrightable, much like a phone book - training a model on the same dataset in the same way usually gives more or less the same model, even if the weights are completely different each time.
I can send you the draft we’ve prepared if you’re interested — drop me an email. But I’ll probably set up a site for this, if only to clear up our motives and expected outcomes.
Ah, I mis-interpreted 'I’ll be participating in a legal action against Meta' as you guys bringing a counter suit of some sort. Thanks for clearing that up.
Copyright lawsuits are usually a case of who has the biggest stamina and hence who has the biggest wallet. Your funding will be a very important part of the outcome, regardless of the legal merits of your defense. You may want to get out of any kind of control of GH because they are strongly connected to OpenAI through Microsoft and hence has a stake in getting rid of any reasonably competent open source LLM.
Make sure you know what you are in for, lawsuits with large counterparties are a rodeo and even if you win they can make your life miserable with endless appeals. You will have to be prepared to spend years on this. Much good luck and if you set up a site do post the link.
"training a model on the same dataset in the same way usually gives more or less the same model, even if the weights are completely different each time."
Couldn't you use a similar type of argument to say that different implementations of the same software API are basically the same software even though the instructions are completely different each time? And, since they do the same thing (aka compatible), that software implementations can't be subject to copyright for that reason? I don't think that holds up, esp in a pro-copyright country.
Here's one I came up with in case your lawyers can use it. My original goal was a license for proprietary content to be used in LLM's where the creators were worried about verbatim extraction or whether their content was sufficiently mixed in with other data. It was about motivating them to let us train on such data. I'll start with those terms:
"1. Percentage of total data. The copywritten work must not be larger than N% of total, training data put into a model. If it's tiny enough, one might be able to argue it only adds so much weigh to the outputs. What if it's the only data of its type, though?
2. Merged with similar data. The copywritten work must be one of multiple examples of the same types of data. For instance, there might be many examples given to the model about what files are, how to generate them, doing it in Python, and specific examples in Python. When it generates Python code, any or all of this might have contributed to it.
3. Ratio of data, set size to number of parameters. The content owners might want the training data to exceed the number of parameters by a multiplier N. For instance, at least 10GB or 100GB going into a 1G model. The multiplier is 10.
4. Diverse data. The content owner might want a wide range of data on many topics to go into the model. They might even specify certain data sets, a minimum number of topics, or even a number of word vectors per word used (their keywords). Once again, the odds the model is just repeating one piece of data goes down as the number of data and similar words in the model goes up."
So, basically you'd be trying to set a standard where anything the model creators legally have access to that they can put into their LLM's. Are the LLM's then carrying their I.P. or something novel? If novel, we're safe from lawsuits. If LLM's and outputs are not copyrightable, we'd be double safe in that situation.So, maybe use criteria like the above to decide what's novel where anything within certain numbers or combinations would be novel automatically by law or court precedent. What do you think?
Congratulation, great paper! It should have been put on HN earlier ;)
I have a few questions:
* you say (page 4): "We then perform standard instruction finetuning on the
base LLaMA-7B model" Could you perhaps provide a reference to the _exact_ finetuning approach you used? I'm afraid different groups of people have a different notion of "standart" (see for example pages 131-155 from https://arxiv.org/abs/2302.08575 for various fine-tuning approaches) and without knowing exactly how fine-tuning was carried out, it can be very difficult reproduce your research and results exactly.
* the idea of using AST Sub-Tree Matching is nice. Could you please let me know which function in which file from your GitHub repository this is implemented in?
I'm still not sure though about some nitpicky things:
- do you change all the weights, or just the ones from the last layer when fine-tuning?
- do you just train on the _code_ field from the JSON file with the self-instruct data, or do you also use the other fields to train (or do you use the other fields just for downstream evaluation purposes)?
I think it could be a major selling point of your paper if on Github (or in an appendix to your preprint, if you update it on arxiv), you had a section where you document the training process in detail
In February I published a paper on mathematics + ChatGPT (https://arxiv.org/abs/2301.13867) and my colleagues, who were also on LinkedIn, told me it garnered quite a bit of attention.
So I signed up to LinkedIn on 3rd Feb. 2023, thinking that now would be a good time to step out of the academic ivory tower.
The thing is that I used my Gmail account to sign into LinkedIn. Which worked fine and I was a happy user for about a month. I then went travelling and when I returned one month later to log in again with Gmail - suprise! A new account was created on the spot.
I offer machine/deep learning consultancy: happy to help you understand the state-of-art research and translate models implemented in recent papers into production code.
I can only work part-time (either regularly a few days per week or larger but fixed blocks of time), because I have ongoing research projects in machine learning that I need to finish, though I can work on these remotely if I need to relocate.
I have two distinct graduate degrees, one in CS and one in pure mathematics and an eye for detail, which is crucial when it comes to doing a sensible implementation of certain neural networks.
> OpenAI's chosen not to release any real details about GPT-4
Actually, they have release some details about it, in this 99-page technical report https://arxiv.org/abs/2303.08774 (which is actually two papers stitches together, once you read it; oddly enough using different fonts).
But I'm not sure if this content qualifies as "real details".
> Given both the competitive landscape and the safety implications of large-scale models like GPT-4, this report contains no further details about the architecture (including model size), hardware, training compute, dataset construction, training method, or similar. We are committed to independent auditing of our technologies, and shared some initial steps and ideas in this area in the system card accompanying this release. We plan to make further technical details available to additional third parties who can advise us on how to weigh the competitive and safety considerations above against the scientific value of further transparency.
In other words, "Stable Diffusion wasn't supposed to happen, so we're making all our methodology trade secret[0], if you want to Do Science then agree to this massive NDA and have enough skin in the game for us to cut you."
[0] Presumably at some point OpenAI will have to 'relent' to independent discovery by patenting AI architectures and refusing to license them
I offer machine/deep learning consultancy: happy to help you understand the state-of-art research and translate models implemented in recent papers into production code.
I can only work part-time (either regularly a few days per week or larger but fixed blocks of time), because I have ongoing research projects in machine learning that I need to finish, though I can work on these remotely if I need to relocate.
I have two distinct graduate degrees, one in CS and one in pure mathematics and an eye for detail, which is crucial when it comes to doing a sensible implementation of certain neural networks.
Remote: Yes
Willing to relocate: Inside EU
Technologies: LLMs -> Python
Résumé/CV: On request
Email: sfriedr_job3323@posteo.co
........................
I have two distinct graduate degrees, one in CS and one in pure mathematics, before I started with AI/ML.
I can work part-time (either regularly a few days per week or larger but fixed blocks of time).
Tell me what your needs are, so we can see if we can make it.