Skimming it, there are a few things about this explanation that rub me just slig...

eiz · on April 15, 2023

> 4. Describing positional embeddings as multiplicative. They are generally (and very counterintuitively to me, but nevertheless) additive with token embeddings.

Worth noting that rotary position embeddings, used in many recent architectures (LLaMA, GPT-NeoX, ...), are very similar to the original sin/cos position embedding in the transformer paper but using complex multiplication instead of addition

jaidhyani · on April 16, 2023

TIL. Man, I'm behind on my paper reading.

sillysaurusx · on April 15, 2023

The positional embedding can be thought of: in the same way you can hear two pieces of music overlaid on each other, you can add both the vocab and pos embedding and it’s able to pick them apart.

If you asked yourself to identify when someone’s playing a high note or low note (pos embedding) and whether they’re playing Beethoven or Lady Gaga (vocab embedding) you could do it.

That’s why it’s additive and why it wouldn’t make much sense for it to be multiplicative.

isaacfung · on April 15, 2023

The visualisation here may be helpful.

https://github.com/tensorflow/tensor2tensor/issues/1591

jaidhyani · on April 16, 2023

Thanks, that's a really useful intuition!

VMG · on April 15, 2023

I have to agree. The article summary says

> Transformer block: Guesses the next word. It is formed by an attention block and a feedforward block.

But the diagram shows transformer blocks chained in sequence. So the next transformer block in the sequence would only receive a single word as the input? Does not make sense.

bomewish · on April 15, 2023

You seem to know a bunch about this. What’s your rec for best single explainer?

isaacfung · on April 15, 2023

Not the guy you asked, but these are often recommended.

https://jalammar.github.io/illustrated-transformer/

https://nlp.seas.harvard.edu/2018/04/03/attention.html

edge17 · on April 15, 2023

Before going and digging into these, could you also explain what the necessary background is for this stuff to be meaningful?

In spite of having done a decent amount with neural networks, I'm a bit lost at how we suddenly got to what we're seeing now. It would be really helpful to understand the progression of things because I stepped away from this stuff for maybe 2 years and we seem to have crossed an ocean in the intervening time.

jaidhyani · on April 16, 2023

I am the guy asked and I endorse this guy's endorsements.

hansvm · on April 15, 2023

> 6

Selecting the likeliest token is only one of many sampling options, and it's extremely poor for most tasks, moreso when you consider the relationships between multiple executions of the model. _Some_ (not necessarily softmax) probability renormalization trained into the model is issential for a lot of techniques.

toxik · on April 15, 2023

To expand on this, one of the most common tricks is Nucleus sampling. Roughly, you zero out the lowest probabilities such that the remaining sum to just above some threshold you decide (often around 80%).

The idea is that this is more general than eg changing the temperature of the softmax, or using top-k where you just keep the k most probable outcomes.

Note that if you do Nucleus sampling (aka top-p) with the threshold p=0% you just pick the maximum likelihood estimate.

jaidhyani · on April 16, 2023

That's true, but they didn't go into any other applications in this explainer and were presenting it strictly as a next-word-predictor. If they are going to include final softmax, they should explain why it's useful. It would be improved by being simpler (skip softmax) or more comprehensive (present a use case for softmax), but complexity without reason is bad pedagogy.

antimora · on April 15, 2023

I am trying to learn more in depth. Could you suggest some good resource for learning transformers?

metanonsense · on April 15, 2023

When I first tried to understand transformers, I superficially understood most material, but I always felt that I did not really get it on a "I am able to build it and I understand why I am doing it" level. I struggled to get my fingers on what exactly I did not understand. I read the original paper, blog posts, and watched more videos than I care to admit.

The one source of information that made it click to me were chapters 159 to 163 of Sebastian Raschka's phenomenal "Intro to deep learning and generative models" course on youtube. https://www.youtube.com/playlist?list=PLTKMiZHVd_2KJtIXOW0zF...

TyrianPurple · on April 15, 2023

Sebastian Raschka's course is really good. Gone through it like three times.

indeedmug · on April 15, 2023

I found these resources to be helpful.

https://jalammar.github.io/illustrated-transformer/ This is a good illustration of the transformer and how the math works.

https://karpathy.ai/zero-to-hero.html If you want a deeper understanding of transform and how they fit in the whole picture of deep learning, this series is far and away the best resource I found. Karpathy goes into transformers by the sixth lecture, the previous lectures give a lot more context how deep learning works.

pankajdoharey · on April 15, 2023

I agree that Karpathy's YouTube video is an excellent resource for understanding Transformers from scratch. It provides a hands-on experience that can be particularly helpful for those who want to implement the models themselves. Here's the link to the video titled "Let's build GPT: from scratch, in code, spelled out": https://youtu.be/kCc8FmEb1nY

Additionally, for more comprehensive resources on Transformers, you may find these resources useful:

* The Illustrated Transformer by Jay Alammar: http://jalammar.github.io/illustrated-transformer/

* MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention: https://www.youtube.com/watch?v=ySEx_Bqxvvo

* Karpathy's course, Deep Learning and Generative Models (Lecture 6 covers Transformers): https://www.youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThs......

These resources cover different aspects of Transformers and can help you grasp the underlying concepts and mechanisms better.

jaidhyani · on April 16, 2023

I endorse all of this and will further endorse (probably as a follow-up once one has a basic grasp) "A Mathematical Framework for Transformer Circuits" which builds a lot of really useful ideas for understanding how and why transformers work and how to start getting a grasp on treating them as something other than magical black boxes.

https://transformer-circuits.pub/2021/framework/index.html

Buttons840 · on April 15, 2023

I've been reading this paper with pseudocode for various transformers and finding it helfpul: https://arxiv.org/abs/2207.09238

"This document aims to be a self-contained, mathematically precise overview of transformer architectures and algorithms (not results). It covers what transformers are, how they are trained, what they are used for, their key architectural components, and a preview of the most prominent models."

quantisan · on April 15, 2023

this one's been mentioned a lot: Let's build GPT: from scratch, in code, spelled out. https://youtu.be/kCc8FmEb1nY

andai · on April 15, 2023

The whole playlist is fantastic: https://youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9Gv...

detrites · on April 15, 2023

This hour-long MIT lecture is very good, it builds from the ground up until transformers. MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention: https://youtube.com/watch?v=ySEx_Bqxvvo

mdp2021 · on April 15, 2023

The uploads of the 2023 MIT 6.S191 course from Alexander Amini (et alii) is ongoing, periodical since mid March. (They published the lesson about Reinforcement Learning yesterday.)

andai · on April 15, 2023

Here's the original paper: https://arxiv.org/abs/1706.03762

jaidhyani · on April 16, 2023

The original paper is very good but I would argue it's not well optimized for pedagogy. Among other things, it's targeting a very specific application (translation) and in doing so adopts a more complicated architecture than most cutting-edge modes actually use (encoder-decoder instead of just one or the other). The writers of the paper probably didn't realize they were writing a foundational document at the time. It's good for understanding how certain conventions developed and important historically - but as someone who did read it as an intro to transformers, in retrospect I would have gone with other resources (e.g. "The Illustrated Transformer").

chaxor · on April 15, 2023

I know we don't have access to the details at OpenAI - but it does seem like there have been significant changes to the BPE token size over time. It seems there is a push towards much larger tokens than the previous ~3 char tokens (at least by behavior)

microtonal · on April 15, 2023

BPE is not set to a certain length, but a target vocabulary size. It starts with bytes (or characters) as the basic unit in which everything is split up and merges units iteratively (choosing the most frequent pairing) until the vocab size is reached. Even 'old' BPE models contain plenty of full tokens. E.g. RoBERTa:

https://huggingface.co/roberta-base/raw/main/merges.txt

(You have to scroll down a bit to get to the larger merges and image the lines without the spaces, which is what a string would look like after a merge.)

Also see GPT-2:

https://huggingface.co/gpt2/raw/main/merges.txt

I recently did some statistics. Average number of pieces per token (sampled on fairly large data, these are all models that use BBPE):

RoBERTa base (English): 1.08

RobBERT (Dutch): 1.21

roberta-base-ca-v2 (Catalan): 1.12

ukr-models/xlm-roberta-base-uk (Ukrainian): 1.68

In all these cases, the median token length in pieces was 1.

(Note: I am not debating that newer OpenAI models don't use a larger vocab. I just want to show that older BBPE models didn't use 3 char pieces. They were 1 piece per token for most tokens.)

montebicyclelo · on April 15, 2023

OpenAI have made their tokenizers public [1].

As someone has pointed out, with BPE you specify the vocab size, not the token size. It's a relatively simple algo, this Huggingface course does a nice job of explaining it [2]. Plus the original paper has a very readable Python example [3].

[1] https://github.com/openai/tiktoken

[2] https://huggingface.co/course/chapter6/5?fw=pt

[3] https://arxiv.org/abs/1508.07909

charcircuit · on April 15, 2023

>and very counterintuitively to me

It's more intuitive if you remember how many dimensions these vectors have.

Hendrikto · on April 15, 2023

> Skipping over BPE as part of tokenization

Well, there are other methods in use. See ByT5, for example.

oergiR · on April 15, 2023

I agree except for (6). A language model assigns probabilities to sequences. The model needs normalised distributions, eg using a softmax, so that’s the right way of thinking about it.

jaidhyani · on April 16, 2023

This is true in general but not in the use case they presented. If they had explained why a normalized distribution is useful it would have made sense - but they just describe this as pick-the-top-answer next-word predictor, which makes the softmax superfluous.