I always wonder what happens when LLMs finally destroyed every source of information they crawl. After stack overflow and forums are gone and when there's no open source code anymore to improve upon. Won't they just canibalize themselves and slowly degrade?
Some studies have shown that direct feedback loops do cause collapse but many researchers argue that it’s not a risk with real world data scales.
In fact, a lot of advancements in the open weight model space recently have been due to training on synthetic data. At least 33% of the data used to train nvidia’s recent nemotron 3 nano model was synthetic. They use it as a way to get high quality agent capabilities without doing tons of manual work.
That's not quite the same thing I think, the risk here is that the sources of training information vanishes as well, not necessarily the feedback loop aspect.
For example all the information on the web could be said to be a distillation of human experiences, and often it ended up online due to discussions happening during problem solving. Questions were asked of the humans and they answered with their knowledge from the real world and years of experience.
If no one asks humans anymore, they just ask LLMs, then no new discussions between humans are occurring online and that experience doesn't get syndicated in a way models can train on.
That is essentially the entirety of Stack Overflows existence until now. You can pretty strongly predict that no new software experience will be put into Stack Overflow from now. So what of new programming languages or technologies and all the nuances within them? Docs never have all the answers, so models will simply lack the nuanced information.
Sorry I guess it's not very clear from my post, the data points aren't what's missing it is the insights. An insight comes from the melding of a life of experiences, you can't just stick a bunch of sensors on humans and reach the same insights. The expression of latent space in our own brains you could think of it as. We're also not little one way input boxes, we run world sims in our brains all day long.
If you were to frame human brains as their own world models, Stack Overflow was a very lossy distillation from the brains of those insights.
I don't think you reach those insights by simply piping in data from the world (also that sounds expensive to do at a worthwhile scale)
Synthetic data. Like AlphaZero playing randomized games against itself, a future coding LLM would come up with new projects, or feature requests for existing projects, or common maintenance tasks for itself to execute. Its value function might include ease of maintainability, and it could run e2e project simulations to make sure it actually works.
AlphaZero playing games against itself was useful because there's an objective measure of success in a game of Go: at the end of the game, did I have more points than my opponent? So you can "reward" the moves that do well, and "punish" the moves that do poorly. And that objective measure of success can be programmed into the self-training algorithm, so that it doesn't need human input in order to tell (correctly!) whether its model is improving or getting worse. Which means you can let it run in a self-feedback loop for long enough and it will get very good at winning.
What's the objective measure of success that can be programmed into the LLM to self-train without human input? (Narrowing our focus to only code for this question). Is it code that runs? Code that runs without bugs? Code without security holes? And most importantly, how can you write an automated system to verify that? I don't buy that E2E project simulations would work: it can simulate the results, but what results is it looking for? How will it decide? It's the evaluation, not the simulation, that's the inescapably hard part.
Because there's no good, objective way for the LLM to evaluate the results of its training in the case of code, self-training would not work nearly as well as it did for AlphaZero, which could objectively measure its own success.
You dont need synthetic data, people are posting vibe coded projects on the github every day and they are being added to next model's training set. I expect in like 4-5 years, humans would just not be able to do things that are not in the training set. Anything novel or fun will be locked down to creative agencies and few holdouts who managed to survive.
That's a valid thought. AS AI generates a lot of content, some of which may be hallucinations, the new cycle of training will be probably using the old + the_new_AI_slop data, and as a result degrade the final result.
Unless the AIs find out where mistakes occur, and find this out in the code they themselves generate, your conclusion seems logically valid.
Hallucinations generally don't matter at scale. Unless you're feeding back 100% synthetic data into your training loop it's just noise like everything else.
Is the average human 100% correct with everything they write on the internet? Of course not. The absurd value of LLMs is that they can somehow manage to extract the signal from that noise.
> The absurd value of LLMs is that they can somehow manage to extract the signal from that noise.
Say what? LLMs absolutely cannot do that.
They rely on armies of humans to tirelessly filter, clean, and label data that is used for training. The entire "AI" industry relies on companies and outsourced sweatshops to do this work. It is humans that extract the signal from the noise. The machine simply outputs the most probable chain of tokens.
So hallucinations definitely matter, especially at scale. It makes the job of humans much, much harder, which in turn will inevitably produce lower quality models. Garbage in, garbage out.
I think you're confused about the training steps for LLMs. What the industry generally calls pre-training is when the LLM learns the job of predicting the most probable next token given a huge volume of data. A large percentage of that data has not been cleaned at all because it just comes directly from web crawling. It's not uncommon to open up a web crawl dataset that is used for pretraining and immediately read something sexual, nonsensical, or both really.
LLMs really do find the signal in this noise because even just pre-training alone reveals incredible language capabilities but that's about it. They don't have any of the other skills you would expect and they most certainly aren't "safe". You can't even really talk to a pre-trained model because they haven't been refined into the chat-like interface that we're so used to.
The hard part after that for AI labs was getting together high quality data that transforms them from raw language machines into conversational agents. That's post-training and it's where the armies of humans have worked tirelessly to generate the refinement for the model. That's still valuable signal, sure, but it's not the signal that's found in the pre-training noise. The model doesn't learn much, if any, of its knowledge during post-training. It just learns how to wield it.
To be fair, some of the pre-training data is more curated. Like collections of math or code.
No, I think you're confused, and doubling down on it, for some reason.
Base models (after pre-training) have zero practical value. They're absolutely useless when it comes to separating signal from noise, using any practical definition of those terms. As you said yourself, their output can be nonsensical, based solely on token probability in the original raw data.
The actual value of LLMs comes after the post-training phase, where the signal is injected into the model from relatively smaller amounts of high quality data. This is the data processed by armies of humans, without which LLMs would be completely worthless.
So whatever capability you think LLMs have to separate signal from noise is exclusively the product of humans. When that job becomes harder, the quality of LLMs will go down. Unless we figure out a way to automate data cleaning/labeling, which seems like an unsolvable problem, or for models to filter it during inference, which is what you're wrongly implying they already do. LLMs could assist humans with cleaning/labeling tasks, but that in itself has many challenges, and is not a solution to the model collapse problem.
I'm not saying that pre-trained only models are useless. They've clearly extracted a ton of knowledge from the corpus. The interface may seem strange because it's not what we're accustom to but they still prove valuable. Code completion models, for example, are just LLMs that have pre-trained exclusively on code. They work very well despite their simplicity because... the model has extracted the signal from the noise.
You have a strange definition of "signal" and "noise".
Code completion models can be useful because they output the most probable chain of tokens given a specific input, same as any LLM. There is no "signal" there besides probability. Besides, even those models are fine-tuned to follow best practices, specific language idioms, etc.
When we talk about "signal" in the context of general knowledge we refer to information that is meaningful and accurate for a specific context and input. So that if the user asks proof of the Earth being flat, the model doesn't give them false information from a random blog. Of course, LLMs still fall short at this, but post-training is crucial to boost the signal away from the noise. There's nothing inherent in the way LLMs work to make them do this. It is entirely based on the quality of the training data.
Are you sure about that? There's a lot of slop on the internet. Imagine I ask you to predict the next token after reading an excerpt from a blog on tortoises. Would you have predicted that it's part of an ad for boner pills? Probably not.
That's not even the worst scenario. There are plenty of websites that are nearly meaningless. Could you predict the next token on a website whose server is returning information that has been encoded incorrectly?
I guess there’ll be less collaboration and less sharing with the outside world, people will still collaborate/share but within smaller circles. It’ll bring an end to the era of sharing is caring interent as it doesn’t benefit anyone but few big players
This only makes sense if the percentage of LLM hallucinations is much higher than the percentage of things written on line being flat wrong (it's definitely not).
Does it matter? Hypothetically if these pre-training datasets disappeared, you can distill from the smartest current model, or have them write textbooks.
Great read! I'm doing something similar with my game engine. I use a FixedBufferAllocator for static allocation and initialize/allocate all my systems and entities with the necessary size at the start. The only exception currently is asset loading because this can be quite dynamic at times.
This also works well for games. I use a FixedBufferAllocator that allocates everything except assets upfront (systems, entities, etc.). Tigerstyle is a good starting point for efficient and debuggable software.Thanks for the article!
This game has the biggest comeback story of any game in gaming history.
I'm really looking forward to their next title and how much of "new" tech they will be showing in it. As far as I understand it will be the same engine as No Man's Sky. But there might be even more content and also this time it seems to be only one big planet?
Keep up the good work! I feel that your content is bringing a lot of educational value to the tech community. I would even say that the visualisations extend the understanding of tech averse people and leads to more people understanding our rather complex ecosystems.
Thank you