More

pongogogo · 2025-10-19T07:21:52 1760858512

It's hard to tell from the data, it's so concentrated within a handful of companies who are all buying from eachother, so it feels like the contagion risk is low. At the same time it feels very clearly overvalued and the size of the inflows are huge.

specproc · 2025-10-19T09:35:37 1760866537

The contagion risk is huge. As the article points out in several ways, the AI bubble is the only part of the US economy where number go up.

Every single bank, fund and retail investor is heavily, if not existentialy, exposed to this house of cards. Absurd promises are being made with national economy-sized volumes of cash.

This is going to take everyone down when it blows.

pongogogo · 2025-09-12T17:58:17 1757699897

The post mentions an approach of using a large model to generate labels and then distilling this into a smaller model to lower cost (though it doesn't provide an example)

pongogogo · 2025-05-09T20:57:11 1746824231

I've been meaning to write a post like this for a while but you've done a much better job.

pongogogo · 2025-04-30T10:57:06 1746010626

I think this is a really interesting paper from Cohere, it really feels that at this point in time you can't trust any public benchmark, and you really need your own private evals.

AstroBen · 2025-04-30T13:09:40 1746018580

Any tips on coming up with good private evals?

pongogogo · 2025-04-30T13:28:50 1746019730

Yes, I wrote something up here on how Andrei Kaparthy evaluated grok 3 -> https://tomhipwell.co/blog/karpathy_s_vibes_check/

I would pick one of two parts of that analysis that are most relevant to you and zoom in. I'd choose something difficult that the model fails at, then look carefully at how the model failures change as you test different model generations.

ilrwbwrkhv · 2025-04-30T12:58:04 1746017884

Yup in my private evals I have repeatedly found that DeepSeek has the best models for everything and yet in a lot of these public ones it always seems like someone else is on the top. I don't know why.

__alexs · 2025-04-30T17:05:21 1746032721

Publishing them might help you find out.

refulgentis · 2025-04-30T17:54:33 1746035673

^ This.

If I had to hazard a guess, as a poor soul doomed to maintain several closed and open source models acting agentically, I think you are hyper focused on chat trivia use cases (DeepSeek has a very, very, hard time tool calling and they say as much themselves in their API docs)

pongogogo · 2025-04-21T07:48:33 1745221713

Hey Mark, I actually found this post via yours so thanks!

pongogogo · 2024-12-26T20:05:29 1735243529

The big news here is the training costs, $5.576m total cost, equivalent to 2788k training hours on H800 GPU at $2 per hour. This for a model that is (according to DeepSeek's own benchmarks) SOTA for open source.

pongogogo · on Dec 17, 2024

From the same team that built Chatbot Arena (used to be called LMSys)

pongogogo · on Dec 3, 2024

Super handy. At a meta level I like the way it's been quickly built as well with a few choice prompts and a vector db.

pongogogo · on Dec 3, 2024

Here's a direct link to the database as well if you just want the juice -> https://www.zenml.io/llmops-database

pongogogo · on Nov 27, 2024

Rule 1 is golden and oft forgotten.

pongogogo · on Nov 16, 2024

I think that's harsh. He's right, and well ahead of his time.