Yeah, burying this on page 8 is a bit suspect imo (the eval datasets are listed on page 3, so if you were familiar with them you would have a hint then).
The distillation of a student that predicts "anchor layers" and then acts as a backbone for classification is perfectly cool on its own; no need to stretch the title/abstract so much.
agreed re: title/abstract stretching. good work stands on its own without needing hype. "we found a nifty way to distill llama-70b using a much smaller student transformer model; the key is using intermediate activation layers in a compressed representation" would be about as effective at selling it while being more immediately approachable IMO
That limitation is already accounted for in how the title is meant to be read. The 224× compression result is specifically about the structure of intermediate activations on classification tasks. The paper makes that explicit in multiple places, including the Limitations section, where generation is identified as an entirely separate challenge.
The title reflects the strongest verified result in the domain the method currently supports, not a universal claim across all modalities. In other words, the compression result is real, but it shouldn't be interpreted as applying to generative decoding... yet.
I really wanted pixel buds to fit this use case, but have found the experience incredibly crap. "Hey Google, let's chat live" is like some mad lottery.
..we found an off the shelf keyboard that could work, but we couldn't get it because it was 999 euros. So let's make 7 iterations of our own keyboard with our Formlabs 3d printer, create silicom molds for each key, print legends with our uv printer and we're done. Glad he did though, looks awesome!
Set nproc_per_node-1 instead of 8 (or run the training script directly instead of using torchrun) and set device_batch_size=4 instead of 32. You may be able to use 8 with a 5090, but it didn't work on my 4090. However it's way slower than expected, one H100 isn't 250x the 4090, so I'm not sure it's training correctly. I'll let it run overnight and see if the outputs make any sense, maybe the metrics are not accurate in this config.
> Generation tasks. Method applies to classification only. Preliminary decoder experiments show perplexity increases.
reply