Something I've noticed with open weight models is the rush to judgment as soon as they are released. But most people aren't actually running these models in full fp16 mode with the code supplied, they're using quantized versions with the tip of tree patches to libraries like llama.cpp to get them running. And posts like this just show that it takes a bit for the software side of the model to get all the kinks worked out. We saw this with Mixtral (new architecture), CodeLlama-70b (new, very strict, prompt format), and now Gemma.
In some ways it's makes my so excited realizing how early this technology still is! There's going to be so much innovation and cool things that will get built over the next several years, and so much new stuff to learn!
Oh yes that's a fair point on precision! In fact the majority of issues for Gemma (other than the Approx v exact GELU) issue are precision based - ie it's fine in float32, but loses a lot of accuracy in bfloat16 or float16 domain!
The argument is that while perplexity is used as evidence that quantized models perform almost just as well as the original float weights, perplexity tends to measure whether the output looks correct, but it doesn't measure performance (roughly equivalent to "intelligence") when you need more nuance.
I haven't been able to observe this myself, perhaps I haven't been playing with language models too much (or I haven't tried to stretch their abilities to their limits enough), but from a theoretical perspective what they say make a lot of sense. Even at the inference stage, the fine details of the implementation of the inference software and parameters could make a big difference of the performance of the models.
So I'd be very skeptical of people trying to evaluate the performance (i.e. intelligence level) of models besides using the stack (preferably down to the hardware) suggested by the party that released the model.
Oh actually this missed me! I normally follow LocalLlama a lot, but just recently forgot to!
In terms of quantizations losing accuracy - this actually does happen - the perplexity seems fine, since perplexity is generally calculated from the forward pass of a model, ie not via generation. This means perplexity is the accuracy of the first token. Imagine you have 99% accuracy and 1% error due to quantization. Over 100 generated tokens, the accuracy rate is 0.99^100 = 36.6% for eg. So over long contexts, quantization definitely cause problems.
Creating quantization aware approaches where long contexts don't get affected becomes a computational challenge sadly. In terms of Unsloth specifically, finetuning is 2x faster on 16bit and quantized models :)
While we're on this topic, wonder whether you have comments about this --
Given that a sentence has a lot of redundant data (grammatical constructs, etc.), saying a model has 99% accuracy might not mean much if it diverges on the "critical" tokens -- for example the keyword in a paragraph, or, the relatively surprising twist in an article.
That's kind of how I interpret "to me it seems that a low perplexity just means that the model is able to produce coherent, readable sentences" (from the LocalLlama post). A model that can write English can have a low perplexity since it's averaged out, but if it can't recall the "correct" token at the critical point, it will still underperform with the low perplexity.
Intuitively this might depend on whether "intelligence" depends on the precision in the bits. It's super hard to measure, which is why even subjective anecdotes or bare assertions like the ones in the post are still interesting.
Hey! I agree if it can't recall the correct token at a "critical point", then definitely even perplexity is low, the sentence becomes unusable.
The main issue is perplexity is just the exp(CE_loss), so essentially minimizing cross entropy loss is the same as minimizing perplexity. And CE is just P(of the next token)
We need some new loss function which probably minimizes say the token of the 2nd or 3rd token which can probably be more effective - sadly it's more computationally expensive, and probably in the long run, might be equivalent to just minimizing CE.
This gave me lots of confidence in Unsloth when I first read it.
I'll admit I was a little skeptical of Unsloth, since anything that boasts free perf improvement, just by dropping in some middleware, makes me suspicious. Especially from such a small team.
I assumed it was just introducing some hacks that create an inexact implementation of attention or some faster-but-inaccurate cuda kernels or something.
But now I believe this small team really knows their stuff :)
The founder I know personally, he interned at Nvidia and contributed many performance improvements, he's the real deal - just really enthuasitic so it may come off as boastfulness ;)
Daniel is of the best engineers I have ever worked with. Engineer in the true sense of wanting to know how something works and figuring out ways to improve it !
They’ve had their work applauded by Karpathy and Jeremy P. Howard as well, which are about the best credentials you could ever get for open source AI stuff:
I’ve been using the library since it started out and it works really well. Daniel is also super helpful and responsive in their Discord, assisting everyone from the most basic users to breaking down complex ML math stuff.
Thanks to Andrej and Jeremy as well :) And also thanks to community members like you! It makes me super happy to keep making Unsloth better so appreciate it a lot!
Oh thanks! I get that a lot :) But ye there's no approximations at all! Just special maths hacks with no degradations, rewriting everything and creating a custom backprop engine, sprinkling Triton / CUDA everywhere and more :)
But thanks you believe in me + my bro more :) Appreciate it a lot!
Incredible work by the author stepping through all the nitty-gritty details and showing how easy it is to miss something subtle that could degrade performance.
Yep! The goal was to implement Gemma in Unsloth to make finetuning faster and use less VRAM, and my reimplementation seems to get different results than the current ones.
Ye it was indeed very gruelling - but very fun!! I used torch.dist everywhere, read ll implementations side by side to compare them, and had to manually inspect losses, plot them etc. It's a bit hard to automate sadly, since new archs cause new issues.
I was just thinking how terrible the website was because it doesn't gracefully degrade. There's no information if you don't successfully execute all the applications associated with the page, just a blank white page with nothing. For a blog post with text and images this is really bad. The text and images should be there in the HTML and then the dynamic elements loaded on top.
Even when I loaded it in the browser I use for banks, etc, I still get errors and the JS doesn't run quite right and I get a "NameError: name 'torch' is not defined", "NameError: name 'FastLanguageModel' is not defined" etc.
Oh ye you'll have to click "Runtime" -> "Run All". I think you probably forgot to execute the installation cell.
Apologies on the website rendering issues :( I normally find Colab to be reasonably responsive, so presumably the Javascript components are breaking :( Much apologies :(
If you load it in a private tab it doesn't ask but it asks if I use a browser session where I am already logged on google (haven't ever loaded Colab in this particular account).
Edit: the comment below refers to Gemini, not Gemma. As such the first paragraph is largely irrelevant, and only the second one applies.
To me, it feels as though the boat has been missed somewhat. The restrictions on Gemini make it unhelpful, but more than that, Claude 3 has really blown me away with its code suggestions. It's performing better than Mistral Large, GPT4 and Gemma in my tests, especially for large bits of code. It also returns the whole hog with changes, making it much easier to plug and play. Astonishingly, it also manages to combine ideas much better than any other LLM I've seen to date.
I suspect these fixes and the knowledge gained will be helpful to the community however, and will help improve the next iteration of models.
Claude 3 is very capable, but it is (likely) a 1T class model, not something that can be run on the edge, while 7B class models can already be run on phones and can be easily fine-tuned to do specialized work that can perform comparably to those big general models.
If you are talking to one model, by all means, use the best one you have available (personally, Claude not having a code interpreter/able to self-evaluate code still makes it oftentimes less useful than ChatGPT (or, even smaller open models like OpenCodeInterpreter - OpenCodeInterpreter-DS-33B outperforms all models including GPT-4 w/ CI on HumanEval+ and MBPP+ [1][2]). Recently I've been swapping between GPT4, Claude 3 Opus, and Phind for coding and finding that sometimes one will do better than another on specific tasks (sadly my GPUs are currently busy, but I really want to queue OCI-DS-33B up and do a shootout soon).
One issue with Gemma that doesn't get mentioned enough IMO is that while it claims to be 7B, it's really 8.54B parameters. It also has a gigantic tokenizer, so memory usage-wise, even quantized it is going to be significantly more than comparable 7B models. Once you are getting to 9B, you have other options as - the new Yi-9B, or if you want Apache licensed (stacked Mistral), you can use SOLAR-10.7B or the new bigstral-12b-32k.
Ye the gigantic tokenizer does eat up VRAM a lot. Although Gemma uses tied embeddings (ie lm_head == embeddings), this does make it use 50% less VRAM in terms of space, but still requires more VRAM since you have to add the gradients up at the end.
why are you comparing Claude 3, a ~14b and ~>200b model, to Gemma, a 2-7B model? of course it's going to do worse. the question for smol models is can it do good enough given a performance budget.
Does anyone know if the major dealbreaker “Additional Terms” apply to Gemma? Because I don’t want to touch anything Google related with a 100 foot pole given the following:
> Use restrictions
You may not use the Services to develop machine learning models or related technology.
Law tends to go by plain English meaning, ex. here, you understand that the idea isn't to ban people from interacting with Gemini, but rather, to stop them from using it to develop new models (i.e. using it's outputs as inputs for training another model)
The issue I find problematic is "Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma." - it's a bit vague - I guess it's not enforceable!
It's probably CYA for any liabilities stemming from issues discovered/not fixed downstream after Google has addressed them. It's hard for Google to enforce offensively, but great for defense!
Ye I guess that's a good point! Sadly some people I chatted to don't really like this, since it constrains and confuses the finetuning aspect of Gemma - ie if we finetune on top of it, then Gemma v2 is released, do we have to do another finetune on top of the latest release?
In some ways it's makes my so excited realizing how early this technology still is! There's going to be so much innovation and cool things that will get built over the next several years, and so much new stuff to learn!