Hacker Newsnew | past | comments | ask | show | jobs | submit | y2244's commentslogin

Pick up a used 3090 with more ram.

It holds it's value so you won't lose much if anything when you resell it.

But otherwise, as said, install Ollama and/or Llama.cpp and run the model using the --verbose flag.

This will print out the token per second result after each promt is returned.

Then find the best model that gives you a token per second speed you are happy with.

And as also said, 'abliterated' models are less censored versions of normal ones.


And I think LM Studio has non commercial restrictions


Investors list include Altman and Ilya

https://www.cerebras.ai/company


Their CEO is a felon who plead guilty to accounting fraud:

https://milled.com/theinformation/cerebras-ceos-past-felony-...

Experienced investors will not touch them:

https://www.nbclosangeles.com/news/business/money-report/cer...

I estimated last year that they can only produce about 300 chips per year and that is unlikely to change because there are far bigger customers for TSMC that are ahead of them in priority for capacity. Their technology is interesting, but it is heavily reliant on SRAM and SRAM scaling is dead. Unless they get a foundry to stack layers for their wafer scale chips or design a round chip, they are unlikely to be able to improve their technology very much past the CSE-3. Compute might somewhat increase in the CSE-4 if there is one, but memory will not increase much if at all.

I doubt the investors will see a return on investment.


>Their CEO is a felon who plead guilty to accounting fraud [...]

Whoa, I didn't know that.

I know he's very close to another guy I know first hand to be a criminal. I won't write the name here for obvious reasons, also not my fight to fight.

I always thought it was a bit weird of them to hang around because I never got that vibe from Feldman, but ... now I came to know about this, 2nd strike I guess ...


CNBC lists several other red flags (one customer generating >80% of revenue, non-top-tier investment bank/auditor).

see https://www.cnbc.com/2024/10/11/cerebras-ipo-has-too-much-ha...

IPO was supposed to happen in autumn 2024.


While the CEO stuff is a problem, I don't think the other stuff matters.

Per chip area WSE-3 is only a little bit more expensive than H200. While you may need several WSE-3s to load the model, if you have enough demand that you are running the WSE-3 at full speed you will not be using more area in the WSE-3. In fact, the WSE-3 may be more efficient, since it won't be loading and unloading things from large memories.

The only effect is that the WSE-3s will have a minimum demand before they make sense, whereas an H200 will make sense even with little demand.


I did the math last year to estimate how many wafers per year Nvidia had, and from my recollection it was >50,000. Cerebras with their ~300 per year is not able to handle the inference needs of the market. It does not help that all of their memory must be inside the wafer, which limits the amount of die area they have for actual logic. They have no prospect for growth unless TSMC decides to bless them or they switch to another foundation.

> While you may need several WSE-3s to load the model, if you have enough demand that you are running the WSE-3 at full speed you will not be using more area in the WSE-3.

You need ~20 wafers to run the Llama 4 Behemoth model on Cerebras hardware. This is close to a million mm^2. The Nvidia hardware that they used in their comparison should have less than 10,000 mm^2 die area, yet can run it fine thanks to the external DRAM. How is the CSE-3 not using more die area?

> In fact, the WSE-3 may be more efficient, since it won't be loading and unloading things from large memories.

This makes no sense to me. Inference software loads the model once and then uses it multiple times. This should be the same for both Nvidia and Cerebras.


Yes, on an ordinary GPU it loads the weights to GPU memory, but then these weights must be moved from GPU memory onto the chip. But on these the weights can presumably be kept on chip entirely-- that's basically their whole point, so with the Cerebras there's no need to ever move weights to the chip.

Of course these guys depend on getting chips, but so does everybody. I don't know how difficult it is, but all sorts of entities get TSMC 5nm. Maybe they'll get TSMC 3nm and 2nm later than NVIDIA, but it's also possible that they don't.


The CSE-3 is divided into 900,000 PEs, which each have only 48kB of RAM:

https://hc2024.hotchips.org/assets/program/conference/day2/7...

Similarly, the SMs in Blackwell have up to 228kB of RAM:

https://docs.nvidia.com/cuda/archive/12.8.0/pdf/Blackwell_Tu...

If you need anything else, you need to load it from elsewhere. In the CSE-3, that would be from other PEs. In Blackwell, that would be from on package DRAM. Idle time in Blackwell be mitigated by parallelism, since each SM has SRAM for multiple kernels to run in parallel. I believe the CSE-3 is quick enough that they do not need that trick.

The other guy said “you will not be using more area in the WSE-3”. I do not see this die area efficiency. You need many full wafers (around 20 with Llama 4 Maverick) to do the same thing with the CSE-3 that can be done with a fraction of a wafer with Blackwell. Even if you include the DRAM’s die area, Nvidia’s hardware is still orders of magnitude more efficient in terms of die area.

The only advantage Cerebras has as far as I can see is that they are fast on single queries, but they do not dare advertise figures for their total throughput, while Nvidia will happily advertise those. If they were better than Nvidia at throughput numbers, Cerebras would advertise them, since that is what matters for having mass market appeal, yet they avoid publishing those figures. That is likely because in reality, they are not competitive in throughput.

To give an example of Nvidia advertising throughput numbers:

> In a 1-megawatt AI factory, NVIDIA Hopper generates 180,000 tokens per second (TPS) at max volume, or 225 TPS for one user at the fastest.

https://blogs.nvidia.com/blog/ai-factory-inference-optimizat...

Cerebras strikes me as being like Bugatti, which designs cars that go from start to finish very fast at a price that could buy dozens of conventional vehicles, while Nvidia strikes me as being like Toyota, which designs far lower vehicles, but can manufacture them in a volume that is able to handle a large amount of the world’s demand for transport. Bugatti can make enough vehicles to bring a significant proportion of the world from A to B regularly, while Toyota can. Similarly, Cerebras cannot make enough chips to handle any significant proportion of the world’s demand for inference, while Nvidia can.


I don't really see how NVIDIA shipping so many chips matters. If more people want Cerebras chips they will presumably be manufactured.

I agree that Cerebras manufacture <300 wafers per year. Probably around 250-300, calculated from $1.6-2 million per unit and their 2024 revenue.

I don't really see how that matters though. I don't see how core counts matter, but I assume that Cerebras is some kind of giant VLIW-y thing where you can give different instructions to different subprocessors.

I imagine that the model weights would be stored in little bits on each processor and that it does some calculation and hands it on.

Then you never need to load the the weights, the only thing you're passing around is activations with them going from wafer 1, to wafer 2, etc. to wafer 20. When this is running at full speed, I believe that this can be very efficient, better than a small GPU like those made by NVIDIA.

Yes, a lot of the area will be on-chip memory/SRAM, but a lot of it will also be logic and that logic will be computing things instead of being used to move things from RAM to on-chip memory.

I don't have any deep knowledge of this system, really, nothing beyond what I've explained here, but I believe that Mistral are using these systems because they're completely superb and superior to GPUs for their purposes, and they will made a carefully weighed decision based on actual performance and actual cost.


You replied really quickly when I had thought I could sneak in a revision, which dropped the estimates for production numbers. In any case, the Cerebras CSE-3 is extremely inefficient for what it does. Inference is memory bandwidth bound, such that peak performance for a single query should be close to the memory bandwidth divided by the weights. Despite having. 2600x the memory bandwidth, they can only perform 2.5 times faster. 1000x of their supposed memory bandwidth is wasted. There are extreme inefficiencies in their architecture. Meanwhile, Nvidia is often within >80% of what memory bandwidth divided by weights predict their hardware can do.

Mistral is a small fish in the grander scheme of things. I would assume that using Cerebras is a way to try to differentiate themselves in a market where they are largely ignored, which is the reason Mistral is small enough to be able to have their needs handled by Cerebras. If they grow to OpenAI levels, there is no chance of Cerebras being able to handle the demand for them.

Finally, I had researched this out of curiosity last year. I am posting remarks based on that.


Inference is memory bandwidth bound on a GPU, which has very little on-chip memory.

On WSE-3s however, there's enough memory that the model can actually be stored on-chip provided that you have a sufficient number of them. 20 are enough for some of the largest open models.

This, depending on how it's set up, allows more efficient use of what logic is available, for actually doing computations instead of just loading and unloading the weights. This can potentially make a system like this much more efficient than a GPU.

It doesn't matter whether Mistral are small fish or not. I don't agree that they are small fish, but whether or not they are they are experts. They are very capable people. They haven't chosen Cerebras to be different, they've chosen it because they believe it's the best way to do inference.


Your “more efficient” remarks are nonsensical to me. Your “loading and unloading weights” remark would be slightly less nonsensical if you called it to Von Neumann bottleneck, but unfortunately for you, their hardware is so bottlenecked internally that they they are getting less than 0.1% of the performance that their supposedly high memory bandwidth can give them. Nvidia on the other hand routinely gets 80% or higher. Calling less than 0.1% of theoretical performance efficient is not only strange, but outright wrong. That said, efficiency usually considers other metrics such as costs, power consumption and throughput.

If you do the math you will find that Cerebras loses in all of them. They need 460 kW from 20x CSE-3 nodes to do inference for Llama 4 Maverick. A single DGX-200 node only needs 14.4kW. If you buy 32 nodes so that power consumption is the same and naively give each a full copy of the model, you would get 32,000 T/sec aggregate from a batch size of 1 while the 20 CSE-3 node cluster only gets 2,500 T/sec aggregate from a batch size of 1. This is having spent only $16 million for the 32 DGX B200 nodes versus the $40 million for the 20 CSE-3 nodes. Each DGX B200 node has 1.4TB of memory, while the CSE-3 cluster has only 880GB of memory. The CSE-3 cluster will run out of memory as you scale the batch size and context length. Now, if you buy another 15 CSE-3 nodes, you could match the memory of a single DGX B200, but then you could just store partial models on each DGX-200 like how Cerebras stored partial models on each CSE-3, and suddenly, you have more memory to scale to higher batch sizes on the Nvidia hardware. At some point, you will likely become compute bound and cannot keep scaling up the batch size, but that is hard to predict without actually testing for it. The prediction for what the CSE-3 could do based on advertised memory bandwidth was off by a factor of >1000 when given real data. It seems reasonable to think that what it can do as far as compute will similarly be limited to well below the theoretical capability.

Note that my numbers for power consumption were from Cerebras:

https://www.cerebras.ai/blog/cerebras-cs-3-vs-nvidia-b200-20...

Interestingly, the peak number for the DGX B200 is based on the power supplies for the DGX B200 and is actually 0.1 kW higher than Nvidia’s specification that puts it at 14.3kW:

https://docs.nvidia.com/dgx/dgxb200-user-guide/introduction-...

PSU peak output is always in excess of the maximum power usage capability of the hardware, but I did not know how Cerebras determined their 23kW figure, so I went with the Cerebras figure for Nvidia, even though I know it is unrealistically high. This likely gave Cerebras the benefit of a handicap on Nvidia’s hardware in the comparison, such that reality is even more in favor of Nvidia.

Calling Cerebras’ hardware the best way of doing inference is ridiculous. We are talking about doing mostly linear algebra. There is no best way of doing it. Pointing at Mistral to say that Cerebras has the best way is an absurd appeal to authority. None of the major players are using them, since they are incapable of handling their needs. The instant responses are nice and are a way for mistral to differentiate itself, but their models are not as good as those from others and few people use them, which is why Cerebras has the capacity to handle their needs.

From a historical standpoint, Cerebras is very similar to Thinking Machines Corporation, which went out of business after 11 years when there was a market downturn because they could not secure business. Cerebras is hemorrhaging money and is only in business because they found some investors willing to cover their losses. Once they run out of people willing to give them money (likely during the next AI winter), they will become insolvent, no matter how good their technology is. When the next AI winter hits, Mistral will likely become insolvent too, since they similarly are hemorrhaging money and are only in business because they found some investors willing to cover their losses.

By the way, you are lecturing someone who actually has worked on code for doing inference:

https://github.com/ryao/llama3.c


You are clearly on some sort of bender against Cerebras. I can tell from your comments that you are the same one guy with the same objections from Twitter, LinkedIn, Reddit. Why are you obsessed with them? I mean sure you seem to know your stuff but some of your assumptions as to why they aren't viable are clearly stretches on the negative side (not that they are impossible, it is just that you don't have the info, and the company being in a cutthroat competitive business has no obligation to share their proprietary business information). And being so well informed you ought to know these are stretches except you are blinded by some emotion for some reason. I mean, sure their solution has downsides (which implementation is perfect), but they will have opportunities to improve in future iterations as they adapt to what the market actually wants rather than what they projected years ago before there was a clear signal. For now, they are a startup with an interesting solution that has some momentum in the marketplace. It is to be seen how they fare but your certainty that they won't succeed is certainly not warranted by the data. And oh, their customers include: Mistrial, Perplexity, Meta, IBM. All those know that the CEO pleaded guilty to accounting charges 18 years ago, after which he has worked continuously in the tech industry including at AMD. A bunch of blue-chip tech investors from OpenAI, AMD, the present CEO of Intel, etc invested with him knowing this. Please give it a rest.


Yes, I don't optimize inference at all myself.

I will have to think through your comment, but won't be able to do so properly this month.


Openai wanted to buy them. G42 the largest player in middle east owne a big chunk. You are simply wrong about big investors not touching them but my guess is they will be bought soon by Meta or Apple.


> Apple

I can't imagine Apple being interested.

Their priority is figuring out how to optimise Apple Silicon for LLM inference so it can be used in laptops, phones and data centres.


I can only imagine Apple being interested. Their NPU hardware is slower than Qualcomm's, their GPUs have been lagging behind Nvidia in all fields since launch, and they refuse to work with any industry leaders to ship a COTS solution. They don't have many options left on the table, "figuring out how to optimize Apple Silicon" has been the plan for 6 years now and no CUDA-killers have come up out of the woodworks since then.

Either Apple entirely forfeits AI to the businesses capable of supplying it, or they change their tactic and do what Apple does best; grossly overpay for a moonshot startup that promises "X for the iPhone". I don't know if that implicates Cerebras, but clearly Apple didn't retain the requisite talent to compete for commercial AI inference capacity.


Cerebras’ technology works by using an entire wafer as a chip and power draw is 23kW if I recall correctly. Their technology cannot be scaled down and only works when scaling up. They could not be more useless for Apple’s purposes. Acquiring them would only give them a bunch of chip design engineers that might or might not be able to make a decent NPU that uses DRAM.

That said, Apple has some talented people already and they likely just need to iterate to make their designs better. Bringing new people on board would just slow progress (see the mythical man month).


Now that Redis have u-turned, is it not worth Valkey and Redis having a chat and seeing if they can merge and combine their efforts?


It's not really a U-turn because they haven't returned to their original license. More of a tack by moving to AGPL.


They didn't move to the AGPL, just added it as an option on the side.


Because if broke bad again, it could just be forked again so no risk?


It's AGPL licensed now so forking it is fairly different to how it was when Valkey forked - it'll have to keep that license, and AGPL is one that quite a lot of companies don't want to touch (whether or not you think that judgement is 'correct').


All of the cloud providers are already in on BSD-licensed Valkey, so which community is realistically going to take over AGPL Redis next time around? The much smaller group of Redict proponents?

There's only so much manpower to go around and much of it is otherwise preoccupied now.


antirez is it you? ;)


Momentum is a very scary thing. I don’t want to be trained to install valkey-server instead of redis-server again (or vice versa)


How often do you (re)install redis-server for this to be a problem? It's just a thing you type once in a build script for me.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: