Skimming it, there are a few things about this explanation that rub me just slightly the wrong way.
1. Calling the input token sequence a "command". It probably only makes sense to think of this as a "command" on a model that's been fine-tuned to treat it as such.
2. Skipping over BPE as part of tokenization - but almost every transformer explainer does this, I guess.
3. Describing transformers as using a "word embedding". I'm actually not aware of any transformers that use actual word embeddings, except the ones that incidentally fall out of other tokenization approaches sometimes.
4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings.
5. "what attention does is it moves the words in a sentence (or piece of text) closer in the word embedding" No, that's just incorrect.
6. You don't actually need a softmax layer at the end, since here they're just picking the top token and they can just do that pre-softmax since it won't change. It's also weird how they talked about this here when the most prominent use of softmax in transformers is actually in the attention component.
7. Really shortchanges the feedforward component. It may be simple, but it's really important to making the whole thing work.
> 4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings.
Worth noting that rotary position embeddings, used in many recent architectures (LLaMA, GPT-NeoX, ...), are very similar to the original sin/cos position embedding in the transformer paper but using complex multiplication instead of addition
The positional embedding can be thought of: in the same way you can hear two pieces of music overlaid on each other, you can add both the vocab and pos embedding and it’s able to pick them apart.
If you asked yourself to identify when someone’s playing a high note or low note (pos embedding) and whether they’re playing Beethoven or Lady Gaga (vocab embedding) you could do it.
That’s why it’s additive and why it wouldn’t make much sense for it to be multiplicative.
> Transformer block: Guesses the next word. It is formed by an attention block and a feedforward block.
But the diagram shows transformer blocks chained in sequence. So the next transformer block in the sequence would only receive a single word as the input? Does not make sense.
Before going and digging into these, could you also explain what the necessary background is for this stuff to be meaningful?
In spite of having done a decent amount with neural networks, I'm a bit lost at how we suddenly got to what we're seeing now. It would be really helpful to understand the progression of things because I stepped away from this stuff for maybe 2 years and we seem to have crossed an ocean in the intervening time.
Selecting the likeliest token is only one of many sampling options, and it's extremely poor for most tasks, moreso when you consider the relationships between multiple executions of the model. _Some_ (not necessarily softmax) probability renormalization trained into the model is issential for a lot of techniques.
To expand on this, one of the most common tricks is Nucleus sampling. Roughly, you zero out the lowest probabilities such that the remaining sum to just above some threshold you decide (often around 80%).
The idea is that this is more general than eg changing the temperature of the softmax, or using top-k where you just keep the k most probable outcomes.
Note that if you do Nucleus sampling (aka top-p) with the threshold p=0% you just pick the maximum likelihood estimate.
That's true, but they didn't go into any other applications in this explainer and were presenting it strictly as a next-word-predictor. If they are going to include final softmax, they should explain why it's useful. It would be improved by being simpler (skip softmax) or more comprehensive (present a use case for softmax), but complexity without reason is bad pedagogy.
When I first tried to understand transformers, I superficially understood most material, but I always felt that I did not really get it on a "I am able to build it and I understand why I am doing it" level. I struggled to get my fingers on what exactly I did not understand. I read the original paper, blog posts, and watched more videos than I care to admit.
https://karpathy.ai/zero-to-hero.html If you want a deeper understanding of transform and how they fit in the whole picture of deep learning, this series is far and away the best resource I found. Karpathy goes into transformers by the sixth lecture, the previous lectures give a lot more context how deep learning works.
I agree that Karpathy's YouTube video is an excellent resource for understanding Transformers from scratch. It provides a hands-on experience that can be particularly helpful for those who want to implement the models themselves. Here's the link to the video titled "Let's build GPT: from scratch, in code, spelled out": https://youtu.be/kCc8FmEb1nY
Additionally, for more comprehensive resources on Transformers, you may find these resources useful:
I endorse all of this and will further endorse (probably as a follow-up once one has a basic grasp) "A Mathematical Framework for Transformer Circuits" which builds a lot of really useful ideas for understanding how and why transformers work and how to start getting a grasp on treating them as something other than magical black boxes.
"This document aims to be a self-contained, mathematically precise overview of transformer architectures and algorithms (not results). It covers what transformers are, how they are trained, what they are used for, their key architectural components, and a preview of the most prominent models."
This hour-long MIT lecture is very good, it builds from the ground up until transformers. MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention: https://youtube.com/watch?v=ySEx_Bqxvvo
The uploads of the 2023 MIT 6.S191 course from Alexander Amini (et alii) is ongoing, periodical since mid March. (They published the lesson about Reinforcement Learning yesterday.)
The original paper is very good but I would argue it's not well optimized for pedagogy. Among other things, it's targeting a very specific application (translation) and in doing so adopts a more complicated architecture than most cutting-edge modes actually use (encoder-decoder instead of just one or the other). The writers of the paper probably didn't realize they were writing a foundational document at the time. It's good for understanding how certain conventions developed and important historically - but as someone who did read it as an intro to transformers, in retrospect I would have gone with other resources (e.g. "The Illustrated Transformer").
I know we don't have access to the details at OpenAI - but it does seem like there have been significant changes to the BPE token size over time. It seems there is a push towards much larger tokens than the previous ~3 char tokens (at least by behavior)
BPE is not set to a certain length, but a target vocabulary size. It starts with bytes (or characters) as the basic unit in which everything is split up and merges units iteratively (choosing the most frequent pairing) until the vocab size is reached. Even 'old' BPE models contain plenty of full tokens. E.g. RoBERTa:
(You have to scroll down a bit to get to the larger merges and image the lines without the spaces, which is what a string would look like after a merge.)
I recently did some statistics. Average number of pieces per token (sampled on fairly large data, these are all models that use BBPE):
RoBERTa base (English): 1.08
RobBERT (Dutch): 1.21
roberta-base-ca-v2 (Catalan): 1.12
ukr-models/xlm-roberta-base-uk (Ukrainian): 1.68
In all these cases, the median token length in pieces was 1.
(Note: I am not debating that newer OpenAI models don't use a larger vocab. I just want to show that older BBPE models didn't use 3 char pieces. They were 1 piece per token for most tokens.)
As someone has pointed out, with BPE you specify the vocab size, not the token size. It's a relatively simple algo, this Huggingface course does a nice job of explaining it [2]. Plus the original paper has a very readable Python example [3].
I agree except for (6). A language model assigns probabilities to sequences. The model needs normalised distributions, eg using a softmax, so that’s the right way of thinking about it.
This is true in general but not in the use case they presented. If they had explained why a normalized distribution is useful it would have made sense - but they just describe this as pick-the-top-answer next-word predictor, which makes the softmax superfluous.
1. Calling the input token sequence a "command". It probably only makes sense to think of this as a "command" on a model that's been fine-tuned to treat it as such.
2. Skipping over BPE as part of tokenization - but almost every transformer explainer does this, I guess.
3. Describing transformers as using a "word embedding". I'm actually not aware of any transformers that use actual word embeddings, except the ones that incidentally fall out of other tokenization approaches sometimes.
4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings.
5. "what attention does is it moves the words in a sentence (or piece of text) closer in the word embedding" No, that's just incorrect.
6. You don't actually need a softmax layer at the end, since here they're just picking the top token and they can just do that pre-softmax since it won't change. It's also weird how they talked about this here when the most prominent use of softmax in transformers is actually in the attention component.
7. Really shortchanges the feedforward component. It may be simple, but it's really important to making the whole thing work.
8. Nothing about the residual