I’ve been telling people for years that a reasonably workable initial, simplified mental model of a large language model is a Markov chain generator with an unlimited, weighted corpus trained in. Very few people who know LLMs have said anything to critique that thought more than that it’s a coarse description and downplays the details. Since being simplified is in the initial statement and it’s not meant to capture detail, I say if it walks like a really big duck and it honks instead of quacking then it’s maybe a goose or swan which are both pretty duck-like birds.
It's not a Markov chain because it doesn't obey the Markov property.
What it is, and what I assume you mean, is a next-word prediction model based solely on the previous sequence of words, up to some limit. It literally is that, because it was designed to be that.
Okay, so we have a function that converts a buffer of tokens to a distribution of next tokens. We give the function a buffer, get back the next-token-distribution, from which we sample to get our next token. We append that to the buffer (and possibly roll off the first token) to obtain a new buffer. If we consider entire buffers to be states (which is a perfectly reasonable thing to do), then we have a stochastic process that moves us from one state to another. How is that not a Markov chain?
That's fair, and yes it is a Markov chain if you frame it that way. Albeit with the interesting property that the state is very very high dimensional, and the current and previous states never differ by more than 1 token.
What about a language model requires knowledge of previous states to choose the next state based on the current state? Are you asserting that only the last word chosen is the current state? Couldn’t the list of symbols chosen so far be the current state, so long as no data is carried forward about how that list has been generated?
Anything can be a markov chain if you define "memory of the past" as a component of the current state and include a big enough memory to hold all of the past for any practical amount of time. Personally, I feel like in most contexts it's not helpful to think of LLMs that way, since their state space (context length) is so large that they are extremely far removed from what we normally think of as markov chains. One might as well define humans as markov chains, since the brain has a finite memory.
Memory of the past implies the current state includes information about how the buffer got the way it is. I’m not defining anything in that way. I’m saying that if you have a buffer of N tokens, the next token N+1 can be chosen based on N tokens, not 1 token and not on the calculations or past state changes that went into creating the current state of the buffer.
So, I have heard a number of people say this, and I feel like I'm the person in your conversations saying it's a coarse description and downplays the details. What I don't understand is, what specifically do we gain from thinking of it as a Markov chain.
Like, what is one insight beyond that LLMs are Markov chains that you've derived from thinking of LLMs as Markov chains? I'm genuinely very curious.
It depends on if you already had experience in using large Markov models for practical purposes.
Around 2009, I had developed an algorithm for building the Burrows–Wheeler transform on (what was back then) very large scale. If you have the BWT of a text corpus, you can use it to simulate a Markov model with any context length. It tried that with a Wikipedia dump, which was amusing for a while but not interesting enough to develop further.
Then, around 2019, I was working in genomics. We were using pangenomes based on thousands of (human) haplotypes as reference genomes. The problem was that adding more haplotypes also added rare variants and rare combinations of variants, which could be misleading and eventually started decreasing accuracy in the tasks we were interested in. The standard practice was dropping variants that were too rare (e.g. <1%) in the population. I got better results with synthetic haplotypes generated by downsampling the true haplotypes with a Markov model (using the BWT-based approach). The distribution of local haplotypes within each context window was similar to the full set of haplotypes, but the noise from rare combinations of variants was mostly gone.
Other people were doing haplotype inference with Markov models based on similarly large sets of haplotypes. If you knew, for a suitably large subset of variants, whether each variant was likely absent, heterozygous, or homozygous in the sequenced genome, you could use the model to get a good approximation of the genome.
When ChatGPT appeared, the application was surprising (even though I knew some people who had been experimenting with GPT-2 and GPT-3). But it was less surprising on a technical level, as it was close enough to what I had intuitively considered possible with large Markov models.
This is pretty stupid because the equation for self attention is pretty trivial to the point where it makes sense to just go straight into the math and not play pretend games.
softmax(QK^T/sqrt(dk))V
Q=W_QX
K=W_KX
V=W_VX
X is a matrix where every column is the embedding of a token.
K and V matrices need to be cached.
Also, you need to calculate all of it, for every layer, even if you just want a single token of output.
The tokens after the first token can reuse the K, V embedding matrices and don't have to recalculate everything for on scratch. Causal attention is the concept of masking the calculation so you only have to look at past tokens.
My critique boils down to this: Transformers are too simple to simplify them. You can only lose information if you do that.
The problem is that it doesn’t stay simplified; once you leave the room those people start using that model to explain details of actual LLM behavior. I’m constantly arguing with people who use this mental model to explain why LLMs can’t do things that they absolutely can do.
I frequently see people underestimate the danger of LLMs to humanity in the long term and to people’s jobs in the medium term because they follow this chain of reasoning:
- An LLM is basically a fancy markov chain (highly dubious even if there’s a tiny kernel of truth in there)
- Therefore markov chains and LLMs have a similar skill ceiling (definitely wrong)
- My job could never be done by a markov chain, it’s much too complex (this is presented self-evidently, no one ever feels the need to back this up)
- Therefore, since LLMs are basically markov chains, and a markov chain could never do my job, I have nothing to worry about.
I’ll grant that you’re not necessarily responsible for these people using your model in a way you didn’t intend. But I do think at this point it’s time to start pushing people to stop trying to reason mechanically about how these things work.
1. An LLM is absolutely a Markov chain. That is not meant to dismiss that they are a remarkable feat of generating and compressing the representation of really cool Markov chains.
2. Because they are Markov chains, their skill ceiling is bounded by whatever the skill ceiling of Markov chains happens to be. What LLMs demonstrate is that the skill ceiling of Markov chains is higher than previously understood.
3. If the skill ceiling of Markov chains is high enough, then one could take over some or all of someone's job.
I think there’s an equivocation being accidentally made between n-gram models, and Markov processes, because “Markov chain” is used to mean both things.
N-gram models are not useful in many of the ways LLMs are.
N-gram models are very limited.
On the other hand, basically any process can be considered a Markov process if you make the state include enough.
So, calling both the very-limited n-gram models and the nearly-unlimited Markov processes, by the same name “Markov chain” is just, super confusing.