Fixing Gemma Bugs

Me1000 · on March 11, 2024

Something I've noticed with open weight models is the rush to judgment as soon as they are released. But most people aren't actually running these models in full fp16 mode with the code supplied, they're using quantized versions with the tip of tree patches to libraries like llama.cpp to get them running. And posts like this just show that it takes a bit for the software side of the model to get all the kinks worked out. We saw this with Mixtral (new architecture), CodeLlama-70b (new, very strict, prompt format), and now Gemma.

In some ways it's makes my so excited realizing how early this technology still is! There's going to be so much innovation and cool things that will get built over the next several years, and so much new stuff to learn!

danielhanchen · on March 12, 2024

Oh yes that's a fair point on precision! In fact the majority of issues for Gemma (other than the Approx v exact GELU) issue are precision based - ie it's fine in float32, but loses a lot of accuracy in bfloat16 or float16 domain!

hnfong · on March 12, 2024

On a related point, there's a post on r/LocalLlama/ a short while ago claiming that quantization impacts perf more than people think:

https://www.reddit.com/r/LocalLLaMA/comments/1b5uv86/perplex...

The argument is that while perplexity is used as evidence that quantized models perform almost just as well as the original float weights, perplexity tends to measure whether the output looks correct, but it doesn't measure performance (roughly equivalent to "intelligence") when you need more nuance.

I haven't been able to observe this myself, perhaps I haven't been playing with language models too much (or I haven't tried to stretch their abilities to their limits enough), but from a theoretical perspective what they say make a lot of sense. Even at the inference stage, the fine details of the implementation of the inference software and parameters could make a big difference of the performance of the models.

So I'd be very skeptical of people trying to evaluate the performance (i.e. intelligence level) of models besides using the stack (preferably down to the hardware) suggested by the party that released the model.

danielhanchen · on March 12, 2024

Oh actually this missed me! I normally follow LocalLlama a lot, but just recently forgot to!

In terms of quantizations losing accuracy - this actually does happen - the perplexity seems fine, since perplexity is generally calculated from the forward pass of a model, ie not via generation. This means perplexity is the accuracy of the first token. Imagine you have 99% accuracy and 1% error due to quantization. Over 100 generated tokens, the accuracy rate is 0.99^100 = 36.6% for eg. So over long contexts, quantization definitely cause problems.

Creating quantization aware approaches where long contexts don't get affected becomes a computational challenge sadly. In terms of Unsloth specifically, finetuning is 2x faster on 16bit and quantized models :)

hnfong · on March 13, 2024

I just realized who I was replying to :)

While we're on this topic, wonder whether you have comments about this --

Given that a sentence has a lot of redundant data (grammatical constructs, etc.), saying a model has 99% accuracy might not mean much if it diverges on the "critical" tokens -- for example the keyword in a paragraph, or, the relatively surprising twist in an article.

That's kind of how I interpret "to me it seems that a low perplexity just means that the model is able to produce coherent, readable sentences" (from the LocalLlama post). A model that can write English can have a low perplexity since it's averaged out, but if it can't recall the "correct" token at the critical point, it will still underperform with the low perplexity.

Intuitively this might depend on whether "intelligence" depends on the precision in the bits. It's super hard to measure, which is why even subjective anecdotes or bare assertions like the ones in the post are still interesting.

danielhanchen · on March 13, 2024

Hey! I agree if it can't recall the correct token at a "critical point", then definitely even perplexity is low, the sentence becomes unusable.

The main issue is perplexity is just the exp(CE_loss), so essentially minimizing cross entropy loss is the same as minimizing perplexity. And CE is just P(of the next token)

We need some new loss function which probably minimizes say the token of the 2nd or 3rd token which can probably be more effective - sadly it's more computationally expensive, and probably in the long run, might be equivalent to just minimizing CE.

Ye intelligence sadly is still hard to define

janwas · on March 13, 2024

Gemma.cpp was also affected, and now fixed (https://github.com/google/gemma.cpp/pull/93). Thanks for the heads-up!

danielhanchen · on March 13, 2024

Oh great! :)

andy_xor_andrew · on March 11, 2024

This gave me lots of confidence in Unsloth when I first read it.

I'll admit I was a little skeptical of Unsloth, since anything that boasts free perf improvement, just by dropping in some middleware, makes me suspicious. Especially from such a small team.

I assumed it was just introducing some hacks that create an inexact implementation of attention or some faster-but-inaccurate cuda kernels or something.

But now I believe this small team really knows their stuff :)

theaussiestew · on March 11, 2024

The founder I know personally, he interned at Nvidia and contributed many performance improvements, he's the real deal - just really enthuasitic so it may come off as boastfulness ;)

danielhanchen · on March 12, 2024

Apologies on the enthusiasm!! :) And hi!!

unrahul · on March 12, 2024

Daniel is of the best engineers I have ever worked with. Engineer in the true sense of wanting to know how something works and figuring out ways to improve it !

danielhanchen · on March 12, 2024

Oh thanks!! Super high praise! :)

theaussiestew · on March 12, 2024

No need to apologise, it's a great trait to have :)

danielhanchen · on March 12, 2024

bugglebeetle · on March 11, 2024

They’ve had their work applauded by Karpathy and Jeremy P. Howard as well, which are about the best credentials you could ever get for open source AI stuff:

https://twitter.com/karpathy/status/1765473722985771335

I’ve been using the library since it started out and it works really well. Daniel is also super helpful and responsive in their Discord, assisting everyone from the most basic users to breaking down complex ML math stuff.

danielhanchen · on March 12, 2024

Thanks to Andrej and Jeremy as well :) And also thanks to community members like you! It makes me super happy to keep making Unsloth better so appreciate it a lot!

lamroger · on March 11, 2024

real recognizing real

danielhanchen · on March 12, 2024

GaggiX · on March 11, 2024

The article mentioned in the comment: https://unsloth.ai/blog/gemma-bugs

danielhanchen · on March 12, 2024

Whoops I think I might have forgotten to add it to the Colab!!

danielhanchen · on March 12, 2024

Oh thanks! I get that a lot :) But ye there's no approximations at all! Just special maths hacks with no degradations, rewriting everything and creating a custom backprop engine, sprinkling Triton / CUDA everywhere and more :)

But thanks you believe in me + my bro more :) Appreciate it a lot!

Yenrabbit · on March 11, 2024

Incredible work by the author stepping through all the nitty-gritty details and showing how easy it is to miss something subtle that could degrade performance.

danielhanchen · on March 11, 2024

Thanks! :) I'm pushing them into transformers, pytorch-gemma and collabing with the Gemma team to resolve all the issues :)

The RoPE fix should already be in transformers 4.38.2: https://github.com/huggingface/transformers/pull/29285

My main PR for transformers which fixes most of the issues (some still left): https://github.com/huggingface/transformers/pull/29402

platniklas · on March 11, 2024

Incredible indeed! Just hunting down one of these bugs feels like a very time consuming endeavor.

What's your approach for these more subtle numerical bugs?

cinntaile · on March 11, 2024

I'm gonna guess he tried to reimplement some of the work from the ground up and wondered why certain results looked like they did.

danielhanchen · on March 12, 2024

Yep! The goal was to implement Gemma in Unsloth to make finetuning faster and use less VRAM, and my reimplementation seems to get different results than the current ones.

danielhanchen · on March 12, 2024

Ye it was indeed very gruelling - but very fun!! I used torch.dist everywhere, read ll implementations side by side to compare them, and had to manually inspect losses, plot them etc. It's a bit hard to automate sadly, since new archs cause new issues.

danielhanchen · on March 12, 2024

The post got edited, so for the Colab which highlights all the fixes + allows you to run finetuning of Gemma 2.5x faster: https://colab.research.google.com/drive/1fxDWAfPIbC-bHwDSVj5...

jerpint · on March 12, 2024

This is why open source can be net beneficial for companies too, helping them spot bugs easily and improve their own tools

danielhanchen · on March 12, 2024

Open source for the win!

lawrenceyan · on March 11, 2024

Really clean usage of Colab btw. I just had to click a single button and everything ran.

Good job, will join the Discord!

superkuh · on March 11, 2024

I was just thinking how terrible the website was because it doesn't gracefully degrade. There's no information if you don't successfully execute all the applications associated with the page, just a blank white page with nothing. For a blog post with text and images this is really bad. The text and images should be there in the HTML and then the dynamic elements loaded on top.

Even when I loaded it in the browser I use for banks, etc, I still get errors and the JS doesn't run quite right and I get a "NameError: name 'torch' is not defined", "NameError: name 'FastLanguageModel' is not defined" etc.

danielhanchen · on March 12, 2024

Oh ye you'll have to click "Runtime" -> "Run All". I think you probably forgot to execute the installation cell.

Apologies on the website rendering issues :( I normally find Colab to be reasonably responsive, so presumably the Javascript components are breaking :( Much apologies :(

jerpint · on March 12, 2024

It sounds like you are simply not familiar with how Colab works, this has nothing to do with the original work

lawrenceyan · on March 11, 2024

??

Just click the `Run All` button. There's no variance, you will always get the same output.

danielhanchen · on March 12, 2024

Oh thanks! I love Colab since it provides a free GPU + you can run the code + you can write a blog post inside it :)

renewiltord · on March 11, 2024

Substantial improvements. Thanks for sharing them.

danielhanchen · on March 12, 2024

Thanks! :)

remram · on March 11, 2024

Here are the missing links:

* Gemma, a family of open models from Google: https://ai.google.dev/gemma

* Unsloth is a tool for training models faster (IIUC): https://github.com/unslothai/unsloth

danielhanchen · on March 12, 2024

Oops apologies on the links! I had to go to sleep - I should have added them maybe as a comment - but glad you did! Thanks!

makapuf · on March 11, 2024

Is there a way to read this without a Google login ?

danielhanchen · on March 12, 2024

You can try our blog post if that works :) https://unsloth.ai/blog/gemma-bugs Also our Twitter feed https://twitter.com/danielhanchen/status/1765446273661075609 which lists all the bugs as well - apologies on the issue!

makapuf · on March 12, 2024

Thanks for the tip!

danielhanchen · on March 12, 2024

a1o · on March 11, 2024

If you load it in a private tab it doesn't ask but it asks if I use a browser session where I am already logged on google (haven't ever loaded Colab in this particular account).

dooglius · on March 11, 2024

I didn't need a Google login to read it

sorokod · on March 11, 2024

To help keep your account secure, Google needs to verify it’s you. Please sign in again to continue.

danielhanchen · on March 12, 2024

Oh my that's unfortunate :( I do have a blog post if that helps: https://unsloth.ai/blog/gemma-bugs

lupire · on March 11, 2024

Wow, colab.research.google.com, that's a terrible domain name for hosting Google-embarassingn user generated content.

ImageXav · on March 11, 2024

Edit: the comment below refers to Gemini, not Gemma. As such the first paragraph is largely irrelevant, and only the second one applies.

To me, it feels as though the boat has been missed somewhat. The restrictions on Gemini make it unhelpful, but more than that, Claude 3 has really blown me away with its code suggestions. It's performing better than Mistral Large, GPT4 and Gemma in my tests, especially for large bits of code. It also returns the whole hog with changes, making it much easier to plug and play. Astonishingly, it also manages to combine ideas much better than any other LLM I've seen to date.

I suspect these fixes and the knowledge gained will be helpful to the community however, and will help improve the next iteration of models.

lhl · on March 11, 2024

Claude 3 is very capable, but it is (likely) a 1T class model, not something that can be run on the edge, while 7B class models can already be run on phones and can be easily fine-tuned to do specialized work that can perform comparably to those big general models.

If you are talking to one model, by all means, use the best one you have available (personally, Claude not having a code interpreter/able to self-evaluate code still makes it oftentimes less useful than ChatGPT (or, even smaller open models like OpenCodeInterpreter - OpenCodeInterpreter-DS-33B outperforms all models including GPT-4 w/ CI on HumanEval+ and MBPP+ [1][2]). Recently I've been swapping between GPT4, Claude 3 Opus, and Phind for coding and finding that sometimes one will do better than another on specific tasks (sadly my GPUs are currently busy, but I really want to queue OCI-DS-33B up and do a shootout soon).

One issue with Gemma that doesn't get mentioned enough IMO is that while it claims to be 7B, it's really 8.54B parameters. It also has a gigantic tokenizer, so memory usage-wise, even quantized it is going to be significantly more than comparable 7B models. Once you are getting to 9B, you have other options as - the new Yi-9B, or if you want Apache licensed (stacked Mistral), you can use SOLAR-10.7B or the new bigstral-12b-32k.

[1] https://huggingface.co/m-a-p/OpenCodeInterpreter-DS-33B

[2] https://evalplus.github.io/leaderboard.html

danielhanchen · on March 12, 2024

Ye the gigantic tokenizer does eat up VRAM a lot. Although Gemma uses tied embeddings (ie lm_head == embeddings), this does make it use 50% less VRAM in terms of space, but still requires more VRAM since you have to add the gradients up at the end.

swyx · on March 11, 2024

why are you comparing Claude 3, a ~14b and ~>200b model, to Gemma, a 2-7B model? of course it's going to do worse. the question for smol models is can it do good enough given a performance budget.

refulgentis · on March 11, 2024

Does that give us more information about Gemma? The others are paywall'd best in class models with an order of magnitude higher parameter count.

1f60c · on March 11, 2024

It's possible that GP confused Gemma and Gemini.

bionhoward · on March 11, 2024

Does anyone know if the major dealbreaker “Additional Terms” apply to Gemma? Because I don’t want to touch anything Google related with a 100 foot pole given the following:

> Use restrictions You may not use the Services to develop machine learning models or related technology.

https://policies.google.com/terms/generative-ai

Note that using Gemini chat model develops it so, taken extremely seriously, this is a blanket ban on sending text to Gemini

refulgentis · on March 11, 2024

Law tends to go by plain English meaning, ex. here, you understand that the idea isn't to ban people from interacting with Gemini, but rather, to stop them from using it to develop new models (i.e. using it's outputs as inputs for training another model)

bionhoward · on March 11, 2024

Hmm, I took it to mean you couldn’t even ask about ML and you must avert your eyes when SGE pops up on ML queries on Google.

Anyway Google lost me as a customer for that so I promise not to help them “develop their models!”

summerlight · on March 11, 2024

https://ai.google.dev/gemma/terms https://ai.google.dev/gemma/prohibited_use_policy

Looks like there's no such restrictions? And this term does not reference the additional terms.

danielhanchen · on March 12, 2024

The issue I find problematic is "Google may update Gemma from time to time, and you must make reasonable efforts to use the latest version of Gemma." - it's a bit vague - I guess it's not enforceable!

sangnoir · on March 12, 2024

It's probably CYA for any liabilities stemming from issues discovered/not fixed downstream after Google has addressed them. It's hard for Google to enforce offensively, but great for defense!

danielhanchen · on March 12, 2024

Ye I guess that's a good point! Sadly some people I chatted to don't really like this, since it constrains and confuses the finetuning aspect of Gemma - ie if we finetune on top of it, then Gemma v2 is released, do we have to do another finetune on top of the latest release?

I guess it's ye, quite unenforceable