Hacker Newsnew | past | comments | ask | show | jobs | submit | m_ke's commentslogin

Was it just me or did Opus start producing incredibly long responses before the crash. I was asking basic questions and it wouldn't stop trying to spit out full codebases worth of unrelated code. For some very simple questions about database schemas it ended up compacting twice on a 3 message conversation.

What a lot of people don’t know is that SWE-bench is over 50% Django code, so all of the top labs hyper optimize to perform well on it.

I know python is more prevalent in SWE-Bench than any other language, but more than 50% django sounds like a big stretch. Citation?

Edit, it's about 37%, and python-only. https://arxiv.org/pdf/2310.06770v3


We really need new hardware optimized for sparse compute. Deep Learning models would work way better with much higher dimensional sparse vectors but current hardware only excels at dense GMMs and structured sparsity.


For what it's worth, we think it's unfortunately quite unlikely that frontier models will ever be trained with extreme unstructured sparsity, even with custom sparsity optimized hardware. Our main hope is that understanding sub-frontier models can still help a lot with ensuring safety of frontier models; an interpretable GPT-3 would be a very valuable object to have. It may also be possible to adapt our method to only explaining very small but important subsets of the model.


yeah it's not happening anytime soon, especially with the whole economy betting trillions of dollars on brute fore scaling of transformers on manhattan sized GPU farms that will use more energy than most mid western states.

Brains do it somehow, so sparsely / locally activated architectures are probably the way to go long term, but we're decades away from that being commercially viable.


As the lead author, why do you think so?


I'm not an expert at hardware, so take this with a grain of salt, but there are two main reasons:

- Discrete optimisation is always going to be harder than continuous optimization. Learning the right sparsity mask is fundamentally a very discrete operation. So even just matching fully continuous dense models in optimization efficiency is likely to be difficult. Though perhaps we can get some hope from the fact that MoE is also similarly fundamentally discrete, and it works in practice (we can think of MoE as incurring some penalty from imperfect gating, which is more than offset by the systems benefits of not having to run all the experts on every forward pass). Also, the optimization problem gets harder when the backwards pass needs to be entirely sparsified computation (see appendix B).

- Dense matmuls are just fundamentally nicer to implement in hardware. Systolic arrays have nice predictable data flows that are very local. Sparse matmuls with the same number of flops nominally only need (up to a multiplicative factor) the same memory bandwidth as an equivalent dense matmul, but they need to be able to route data from any memory unit to any vector compute unit - the locality of dense matmuls means that the computation of each tile only requires a small slice of both input matrices, so we only need to load those slices into shared memory; on the other hand, because GPU-to-GPU transfers are way slower, when we op-shard matmuls, we replicate the data that is needed. Sparse matmuls would need either more replication within each compute die, or more all-to-all internal bandwidth. This means spending way more die space on huge crossbars and routing. This would cost a lot of die space, though thankfully, the crossbars consume much less power than actual compute, so perhaps this could match dense in energy efficiency and not make thermals worse.

It also seems very likely that once we create the interpretable GPT-1 (or 2, or 3) we will find that making everything unstructured sparse was overkill, and there are much more efficient pretraining constraints we can apply to models to 80/20 the interpretability. In general, a lot of my hope routes through learning things like this from the intermediate artifact (interpretable GPT-n).

To be clear, it doesn't seem literally impossible that with great effort, we could create custom hardware, and vastly improve the optimization algorithms, etc, such that weight-sparse models could be vaguely close in performance to weight-dense models. It's plausible that with better optimization the win from arbitrary connectivity patterns might offset the hardware difficulties, and I could be overlooking something that would make the cost less than I expect. But this would require immense effort and investment to merely match current models, so it seems quite unrealistic compared to learning something from interpretable GPT-3 that helps us understand GPT-5.


Yes it would require completely new hardware and most likely ditching gradient descent for alternative optimization methods, though I'm not convinced that we'd need to turn to discrete optimization.

Some recent works that people might find interesting:

- Evolution Strategies at the Hyperscale - https://eshyperscale.github.io/

- Introducing Nested Learning: A new ML paradigm for continual learning - https://research.google/blog/introducing-nested-learning-a-n...

- Less is More: Recursive Reasoning with Tiny Networks - https://arxiv.org/abs/2510.04871

- Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs - https://arxiv.org/abs/2511.16664


A note on the hardware part: it does not require NN-specific hardware akin to neuromorphic. Sparse compute oriented architectures already have been developer for other reasons, such as large scale graph analysis or inference. It will still require significant effort to use it to train large models, but it would not be starting from scratch.


Yes! I'de been advocating for it inside the industry for a decade, but it is an uphill battle. The researchers can't easily publish that kind of work (even Google researchers) because you don't have the hardware that can realistically train decently large models. The hardware companies don't want to take the risk a rethinking the architecture CPU or accelerator for sparse compute because there are no large existing customers.


There also needs to be tools that can author that code!

Im starting to dust off some ideas I developed over a decade ago to build such a toolkit. Recently realized “egads, my stuff can express almost every major gpu / cpu optimization that’s relevant for modern deep learning… need to do a new version with an eye towards adoption in that area”. Plus every flavor of sparse.

Also need to figure out if some of the open core ideas i have in mind would be attractive to early stage investors who focus on the so-called deep tech end of the space. Definitely looks like ill have to do ye olde ask friends and acquaintances if they can point me to those folks approach since cold reach out historically is full of fail


Deep Learning models would work way better with much higher dimensional sparse vectors

Citations?


There has been plenty of evidence over the year. I don't have my bibliography handy right now, but you can find them looking for sparse training or lottery ticket hypothesis papers.

The intuition is that ANNs make better predictions on high dimensional data, sparse weights can train the sparsity pattern as you train the weights, that the effective part of dense models are actually sparse (CFR pruning/sparsification research), and that dense models grow too much in compute complexity to further increase model dimension sizes.


If you can give that bibliography I'd love to read it. I have the same intuition and a few papers seem to support it but more and explicit ones would be much better.


I could not find any evidence that sparse models work better than dense models.


What do you mean by work better here? If it's for better accuracy then no they are not better at the same weight dimensions.

The big thing is that sparse models allow you to train models with significantly larger dimensionality, blowing up the dimensions several orders of magnitudes. More dimensions leading to better results does not seem to be under a lot of contention, the open questions are more about quantifying that. It's simply not shown experimentally because the hardware is not there to train it.


The big thing is that sparse models allow you to train models with significantly larger dimensionality, blowing up the dimensions several orders of magnitudes.

Do you have any evidence to support this statement? Or are you imagining some not yet invented algorithms running on some not yet invented hardware?


Sparse matrices can increase in dimension while keeping the same number of non-zeroes, that part is self evident. Sparse weights models can be trained, you probably are already aware of RigL and SRigL, there is similar other related work on unstructured and structured sparse training. You could argue that those adapt their algorithm to be executable on GPUs and that none are training at x100 or x1000 dimensions. Yes, that is the part that requires access to sparse compute hardware acceleration, which exists as prototypes [1] or are extremely expensive (Cerebras).

[1] https://dl.acm.org/doi/10.1109/MM.2023.3295848


Unstructured sparsity cannot be implemented in hardware efficiently if you still want to do matrix multiplication. If you don’t want to do matrix multiplication you first need to come up with new algorithms, tested in software. This reminds me of what Numenta tried to do with their SDRs - note they didn’t quite succeed.


> Unstructured sparsity cannot be implemented in hardware efficiently if you still want to do matrix multiplication.

Hard disagree. It certainly is a magnitude harder to design hardware for sp x sp MM, yes; it requires a paradigm shift to do sparse compute efficiently, but there are hardware architectures both in research and commercially available that do it efficiently. The same kind of architecture is needed to scale op graph compute. You see solutions at the smaller scale in FPGA and reconfigurable/dataflow accelerators, larger scale in Intel's PIUMA and Cerebras. I've been involved in co-design work of Graphblas on the software side and one of the aforementioned hardware platforms: the main issue with developing SpMSpM hardware lies more with the necessary capital and engineering investments being prioritized to current frontier AI model accelerators, not because of lack of proven results.


All of the best open source LLMs right now use mixture-of-experts, which is a form of sparsity. They only use a small fraction of their parameters to process any given token.

Examples: - GPT OSS 120b - Kimi K2 - DeepSeek R1


Mixture of experts sparsity is very different from weight sparsity. In a mixture of experts, all weights are nonzero, but only a small fraction get used on each input. On the other hand, weight sparsity means only very few weights are nonzero, but every weight is used on every input. Of course, the two techniques can also be combined.


Correct. I was more focused on giving an example of sparsity being useful in general, because the comment I was replying didn't specifically mention which kind of sparsity.

For weight sparsity, I know the BitNet 1.58 paper has some claims of improved performance by restricting weights to be either -1, 0, or 1, eliminating the need for multiplying by the weights, and allowing the weights with a value of 0 to be ignored entirely.

Another kind of sparsity, while on the topic is activation sparsity. I think there was an Nvidia paper that used a modified ReLU activation function to make more of the models activations set to 0.


“Useful” does not mean “better”. It just means “we could not do dense”. All modern state of the art models use dense layers (both weight and inputs). Quantization is also used to make models smaller and faster, but never better in terms of quality.

Based on all examples I’ve seen so far in this thread it’s clear there’s no evidence that sparse models actually work better than dense models.


Yes, mixture of experts is basically structured activation sparsity. You could imagine concatenating the expert matrices into a huge block matrix and multiplying by an input vector where only the coefficients corresponding to activated experts are nonzero.

From that perspective, it's disappointing that the paper only enforces modest amounts of activation sparsity, since holding the maximum number of nonzero coefficients constant while growing the number of dimensions seems like a plausible avenue to increase representational capacity without correspondingly higher computation cost.


https://transformer-circuits.pub/2022/toy_model/index.html

https://arxiv.org/abs/1803.03635

EDIT: don't have time to write it up, but here's gemini 3 with a short explanation:

To simulate the brain's efficiency using Transformer-like architectures, we would need to fundamentally alter three layers of the stack: the *mathematical representation* (moving to high dimensions), the *computational model* (moving to sparsity), and the *physical hardware* (moving to neuromorphic chips).

Here is how we could simulate a "Brain-Like Transformer" by combining High-Dimensional Computing (HDC) with Spiking Neural Networks (SNNs).

### 1\. The Representation: Hyperdimensional Computing (HDC)

Current Transformers use "dense" embeddings—e.g., a vector of 4,096 floating-point numbers (like `[0.1, -0.5, 0.03, ...]`). Every number matters. To mimic the brain, we would switch to *Hyperdimensional Vectors* (e.g., 10,000+ dimensions), but make them *binary and sparse*.

  * **Holographic Representation:** In HDC, concepts (like "cat") are stored as massive randomized vectors of 1s and 0s. Information is distributed "holographically" across the entire vector. You can cut the vector in half, and it still retains the information (just noisier), similar to how brain lesions don't always destroy specific memories.
  * **Math without Multiplication:** In this high-dimensional binary space, you don't need expensive floating-point matrix multiplication. You can use simple bitwise operations:
      * **Binding (Association):** XOR operations (`A ⊕ B`).
      * **Bundling (Superposition):** Majority rule (voting).
      * **Permutation:** Bit shifting.
  * **Simulation Benefit:** This allows a Transformer to manipulate massive "context windows" using extremely cheap binary logic gates instead of energy-hungry floating-point multipliers.
### 2\. The Architecture: "Spiking" Attention Mechanisms

Standard Attention is $O(N^2)$ because it forces every token to query every other token. A "Spiking Transformer" simulates the brain's "event-driven" nature.

  * **Dynamic Sparsity:** Instead of a dense matrix multiplication, neurons would only "fire" (send a signal) if their activation crosses a threshold. If a token's relevance score is low, it sends *zero* spikes. The hardware performs *no* work for that connection.
  * **The "Winner-Take-All" Circuit:** In the brain, inhibitory neurons suppress weak signals so only the strongest "win." A simulated Sparse Transformer would replace the Softmax function (which technically keeps all values non-zero) with a **k-Winner-Take-All** function.
      * *Result:* The attention matrix becomes 99% empty (sparse). The system only processes the top 1% of relevant connections, similar to how you ignore the feeling of your socks until you think about them.
### 3\. The Hardware: Neuromorphic Substrate

Even if you write sparse code, a standard GPU (NVIDIA H100) is bad at running it. GPUs like dense, predictable blocks of numbers. To simulate the brain efficiently, we need *Neuromorphic Hardware* (like Intel Loihi or IBM NorthPole).

  * **Address Event Representation (AER):** Instead of a "clock" ticking every nanosecond forcing all neurons to update, the hardware is asynchronous. It sits idle (consuming nanowatts) until a "spike" packet arrives at a specific address.
  * **Processing-in-Memory (PIM):** To handle the high dimensionality (e.g., 100,000-dimensional vectors), the hardware moves the logic gates *inside* the RAM arrays. This eliminates the energy cost of moving those massive vectors back and forth.
### Summary: The Hypothetical "Spiking HD-Transformer"

| Feature | Standard Transformer | Simulated "Brain-Like" Transformer | | :--- | :--- | :--- | | *Dimension* | Low (\~4k), Dense, Float32 | *Ultra-High* (\~100k), Sparse, Binary | | *Operation* | Matrix Multiplication (MACs) | *Bitwise XOR / Popcount* | | *Attention* | Global Softmax ($N^2$) | *Spiking k-Winner-Take-All* (Linear) | | *Activation* | Continuous (RELU/GELU) | *Discrete Spikes* (Fire-or-Silence) | | *Hardware* | GPU (Synchronous) | *Neuromorphic* (Asynchronous) |


I’m not sure why you’re talking about efficiency when the question is “do sparse models work better than dense models?” The answer is no, they don’t.

Even the old LTH paper you cited trains a dense model and then tries to prune it without too much quality loss. Pruning is a well known method to compress models - to make them smaller and faster, not better.


Before we had proper GPUs everyone said the same thing about Neural Networks.

Current model architectures are optimized to get the most out of GPUs, which is why we have transformers dominating as they're mostly large dense matrix multiplies.

There's plenty of work showing transformers improve with inner dimension size but it's not feasible to scale them up further because it blows up parameter and activation sizes (including KV caches) so people to turn to low rank ("sparse") decompositions like MLA.

Lottery ticket hypothesis shows that most of the weights in current models are redundant and that we could get away with much smaller sparse models, but currently there's no advantage to doing so because on GPUs you still end up doing dense multiplies.

Plenty of mech interp work shows that models are forced to commingle different concepts to fit them into the "low" dimensional vector space. (https://www.neelnanda.io/mechanistic-interpretability/glossa...)

https://arxiv.org/abs/2210.06313

https://arxiv.org/abs/2305.01610


Yes, we know that large dense layers work better than small dense layers (up to a point). We also know how to train large dense models and then prune them. But we don’t know how to train large sparse models to be better than large dense models. If someone figures it out then we can talk about building hardware for it.


It isn't directly what you are asking for, but there is a similar relationship at work with respect to L_1 versus L_2 regularization. The number of samples required to train a model is O(log(d)) for L_1 and O(d) for L_2 where d is the dimensionality [1]. This relates to the standard random matrix results about how you can approximate high dimensional vectors in a log(d) space with (probably) small error.

At a very handwaving level, it seems reasonable that moving from L_1 to L_0 would have a similar relationship in learning complexity, but I don't think that has every been addressed formally.

[1] https://www.andrewng.org/publications/feature-selection-l1-v...


My last dive into matrix computations was years ago, but the need was the same back then. We could sparsify matrices pretty easily, but the infrastructure was lacking. Some things never change.


On the software side I can recommend https://github.com/DrTimothyAldenDavis/GraphBLAS It is hard to make a sparse linear algebra framework, but Tim Davis has been doing a great job collecting the various optimal algorithms I to a single framework that acts more like an algebra than a collection of kernels.


Soon OpenAI will make its own chips and Nvidia its own foundational models


Scaling Laws are the ultimate VC ponzi scheme vehicle.

Keep raising 10x more for each round of scaling and very quickly you get large enough to be able to bully anyone into playing with you.

Sama got big enough to be able to twist any arm he wants and soon OpenAI will be too large to fail.


The Dutch East India Company wasn't too big to fail. There is nothing too big to fail and if anything too big is guaranteed to fail.


Too big to fail means that as long as government backing it is stable enough to support it, it likely will.

Sure Bank of America could fail and US Government can’t back it but that’s likely massive upheaval in United States.


Yes US can fail first and take OpenAI down with it


Literally nothing would happen if OpenAI ceased to exist tomorrow. Everyone would migrate to Claude and Gemini and the sun would rise on time.


You could say the same about the car manufacturers, lehman and every company in the valley.

The whole US economy is getting propped up by AI spending, and to continue on the scaling law ladder each new iteration requires 10x more investment. If OpenAI is not able to raise it's over for anyone adjacent to them (NVIDIA, AMD, Anthropic, etc.).

It's AGI or bust for OpenAI, because as you said there's no margin competing on API tokens since open source is almost as good for most use cases.


This is very wrong. You're comparing OpenAI to Lehman Bros? The collapse of OpenAI would not have even a remotely comparable impact to the great recession. Were you even alive at that time? I struggle to imagine how you could make this assertion.

https://en.wikipedia.org/wiki/Subprime_mortgage_crisis


Worse than the bubble is what's happening to the rest of the economy. If you remove AI related spending the US economy is trending in a really bad direction.

This bubble popping will definitely take down crypto with it and rip through other adjacent industries.


I used to work on video generation models and was shocked at how hard it was to find any videos online that were not hosted on YouTube, and YouTube has made it impossibly hard to download more than a few videos at a time.


> YouTube has made it impossibly hard to download more than a few videos at a time

I wonder why. Perhaps because people use bots to mass-crawl contents from youtube to train their AI. And Youtube prioritizes normal users who only watch a few videos at most at the same time, over those crawling bots.

Who knows?


I wonder how Google built their empire. Who knows? I’m sure they didn’t scrape every page and piece of media on the internet and train models on it.

My point was that the large players have monopoly hold on large swaths of the internet and are using it to further advantage themselves over the competition. See Veo 3 as an example, YouTube creators didn’t upload their work to help Google train a model to compete with them but Google did it anyways, and creators didn’t have a choice because all eye balls are on YouTube.


> how Google built their empire. Who knows

By scraping every page and directing the traffic back to the site owners. That was how Google built their empire.

Are they abusing the empire's power now? In multiple ways, such as the AI overview stuff. But don't pretend that crawling Youtube and training video generation models is the same as what Google (once) brought to the internet. And it's ridiculous to expect Youtube to make it easy for crawlers.


you have to feed it multiple arguments with rate limiting and long wait times. i am not sure if there have been recent updates other than the js interpreter but ive had to spin up a docker instance of a browser to feed it session cookies as well.


Yeah we had to roll through a bunch of proxy servers on top of all the other tricks you mentioned to reliably download at a decent pace


What are your thoughts on the load scrapers are putting on website operators?


What are your thoughts on the load website operators are putting on themselves to block scrapers?


[flagged]


Unusually well-argued post, hard to disagree with...

What exactly is the problem? That they worked on video generation models? That they only used YouTube? That they downloaded videos from YouTube? That they downloaded multiple videos from YouTube?


They’re all already doing this and doing it more will go unnoticed


Yeah kinda hard to see companies being more aggressive than they already are about outsourcing. I know companies that fired their entire tech org from the CTO down and moved it to India.


When I was looking for work early this year I was told that most of the Google NYC roles were listed for internal transfers and that most of the actual hiring was in Warsaw (with 1000s of open roles, which I was told by Google recruiters at a conference in Europe)


This is true for most SV tech companies with multiple offices (including nyc) because there are a shitload of men trying to move out of SF.

Post-pandemic most single men in Silicon Valley have realized that the region is terrible for anything but settling down with a family.


If someone is transferring from SF to NYC they wouldn't have to advertise the position. I think the OP is referring to transferring people into the country on L1.


I was told that they were actually required to list them even if it’s someone transferring internally.

It was for a few specific ML research roles that I was interested in, of which there were very few in NYC and during the interview process I was told that they would go to internal candidates


Yeah it's even worse than that. These big cos will be incentivized to move whole teams out of the US since it will be easier to hire from other countries for offices in Paris / Zurich / Warsaw / etc.


Isn't that already the case, though? Offshoring has been a thing for decades, but companies clearly prefer to have employees on site, in the US, if possible.

Yes, this new fee will make that more expensive to do, but I'm not convinced it will no longer be worth it for most companies.


Could all pop today if GPT5 doesn’t benchmark hack hard on some new made up task.


I don't see how it would "all pop" - same as with the internet bubble, even if the massive valuations disappear, it seems clear to me that the technology is already massively disruptive and will continue growing its impact on the economy even if we never reach AGI.


Exactly like the internet bubble. I've been working in Deep Learning since 2014 and am very bullish on the technology but the trillions of dollars required for the next round of scaling will not be there if GPT-5 is not on the exponential growth curve that sama has been painting for the last few years.

Just like the dot com bubble we'll need to wash out a ton of "unicorn" companies selling $1s for $0.50 before we see the long term gains.


> Exactly like the internet bubble.

So is this just about a bit of investor money lost? Because the internet obviously didn't decline at all after 2000, and even the investors who lost a lot but stayed in the game likely recouped their money relatively quickly. As I see it, the lesson from the dot-com bust is that we should stay in the game.

And as for GPT-5 being on the exponential growth curve - according to METR, it's well above it: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...


I wouldn't say "well above" when the curve falls well within the error bars. I wonder how different the plot would look if they reported the median as their point estimate rather than mean.


I don't expect GPT-5 to be anything special, it seems OpenAI hasn't been able to keep its lead, but even current level of LLMs to me justifies the market valuations. Of course I might eat my words saying that OpenAI is behind, but we'll see.


> I don't expect GPT-5 to be anything special

because ?


Because everything past GPT 3.5 has been pretty unremarkable? Doubt anyone in the world would be able to tell a difference in a blind test between 4.0, 4o, 4.5 and 4.1.


I would absolutely take you on a blind test between 4.0 and 4.5 - the improvement is significant.

And while I do want your money, we can just look at LMArena which does blind testing to arrive at an ELO-based score and shows 4.0 to have a score of 1318 while 4.5 has a 1438 - it's over twice likely to be judged better on an arbitrary prompt, and the difference is more significant on coding and reasoning tasks.


> Doubt anyone in the world would be able to tell a difference in a blind test between 4.0, 4o, 4.5 and 4.1.

But this isn't 4.6 . its 5.

I can tell difference between 3 and 4.


That's a very Spinal Tap argument for why it will be more than just an incremental improvement.


Well word on the street is that the OSS models released this week were Meta-Style benchmaxxed and their real world performance is incredibly underwhelming.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: