More

fmap · 2025-09-26T18:38:38 1758911918

Isn't this an old idea? E.g., here's a textbook on optimization algorithms for matrix manifolds https://press.princeton.edu/absil and here's a library that implements this in python for the Stiefel manifold that's the subject of this blog post: https://pymanopt.org/docs/stable/manifolds.html#module-pyman...

What is novel about the approach in the blog post? Serious question, I really can't tell after reading the post.

cs702 · 2025-09-26T19:16:50 1758914210

I don't think it's been tried at scale, with large models.

It remains to be seen if it works better than conventional training schemes.

godelski · 2025-09-26T20:27:18 1758918438

  > Isn't this an old idea?

So are neural networks. So is attention.

What's your point? Sometimes things need to be retried. Sometimes there are small subtle details that make or break an idea. So what's the point of acting dismissively? If an old idea that didn't work now works, then what's the problem? Besides, progress is typically iterative, not through leaps and bounds. The vast majority of things that look like leaps are just because we don't see the steps between.

The reason I'm saying this is because that sentiment is often used to pass over working solutions and slows down their progress. So even if unintentional it should cause us to rethink how we respond. Otherwise we end up with such silly claims like "Einstein just used Tensors" and "Nash just used topology". In some sense these are accurate, but they are too high level descriptions (and these are real dismissals. Which again, so what? If it works, it works?).

Why is "novelty" even a question? Novelty is only ever in the eyes of the beholder.

  > What is novel about the approach in the blog post? Serious question, I really can't tell after reading the post.

Honestly, I do not know, but I'll give you my best read on it.

1) Scale: Don't underestimate the importance of this. While I don't think scale is all you need, it certainly is a critical factor.

2) Different optimization: I may be missing something, but it looks like they are using a different optimizer. They mention that they're using the muon optimizer constraining to a Stiefel manifold. Neither of those things are unique on their own, but is their combination? This is where I'm uncertain because such a thing would be easy to miss. Maybe someone did and was unsuccessful with it. Maybe someone did, but was not at scale. Maybe someone did, it worked, and just nobody noticed (that happens a lot!).

So I think this is quite similar to how 99% of progress and breakthroughs are made: putting together ideas that seem unrelated and inventing some glue to generalize the process. At a high level this always looks like you're just putting existing things together, but that glue is really hard to make. And to continue that analogy, if we do a good enough job gluing things together then to anyone but an expert it'll look like there is no glue. It can be surprisingly difficult to tell if something is glued, welded, mated, milled, printed, or whatever. It usually takes a very keen eye to determine the answer non-destructively.

fmap · 2025-09-26T20:46:55 1758919615

Apologies if this came across the wrong way. I really do want to know what the novel contributions of the post are, because the author implies that something about what they're doing is solving previously open problems:

> I figured out how to solve manifold Muon in the square case late last year, but I was unable to solve the full rectangular case and thus posed the problem as an open problem on the Modula docs. Jianlin Su solved the problem this summer

It sounds like the generalisation of projected gradient decent to "Muon" is what they're focusing on, but the derivation is all about the retraction map on the Stiefel manifold? I think I'm missing some background here.

godelski · 2025-09-26T22:46:45 1758926805

  > Apologies if this came across the wrong way

I was uncertain but your other statements made me think that sentiment was unintentional. I just want to push back against it because it is too common and misused even with good intentions. I hope you don't see this as me saying anything about your character. Honestly, impressions are that you do care.

  > It sounds like the generalisation of projected gradient decent to "Muon"

I'm not a niche expert here, but do have knowledge in adjacent/overlapping domains. It sounds like you're in a similar boat? I ask because this pulls back to what I was trying to say about sometimes needing an expert eye.

If it helps, here's the "paper" for the Muon optimizer[0] and here's a follow-up[1]. Muon is definitely a gradient decent technique, but so are Adam, SGD, Ada, and many more[2].

The big thing of Muon is using NewtonSchulz5. So you update parameters with θ_{t-1} - η[NS_5(μB_{t-1} + ∇L(θ_{t-1}))] (I bracketed so you can see that this is just a specific version of θ_{t-1} - ηF(∇L(θ_{t-1}),...) which the standard gradient descent -- θ - η∇L(θ) -- is in that class of functions, right?). So we should be careful to over generalize and say that this is just gradient descent. You could even say [1] is "just [0] but with weight-decay" (or go look at the Adam and AdamW algos ;)

But one thing I should add is that gradient descent algorithms aren't topologically aware. I was able to find this post which asks a related question, trying to find what the conditions are for a surface's geodesic to align with gradient descent (note Newton differs from GD too). I don't think this paper is creating a solution where the GD formulation results in following a geodesic to the minimum, but my take is that it is working towards that direction. And to clarify, we'd want to follow the geodesic because that gives us the shortest or most energy efficient path (which ever perspective you want to use). In optimization we want to try to accomplish these two things (and more!): 1) take the "best" path to the optima, 2) find the best optima. Unfortunately these are ill-defined and there's not always objective answers to them. But in an ideal gradient descent algorithm we'd want it to go to the global minimum and take the fastest path, right? So with that it helps to be aware of the geometry (part of why people look at the Hessian but that comes at the cost of increased computation even if the additional information can get us there in fewer steps. So that's not (always) "the best").

I know this isn't a full answer and maybe with more reading I'll have a better one for you. But I'm hoping my answer can at least help you see some of the underlying nuanced problems that (_I think_) the authors are trying to get at. Hopefully I'm not too far off base lol. I'm hoping someone with more expertise can jump in and provide corrections/clarifications in the mean time.

[0] https://kellerjordan.github.io/posts/muon/

[1] https://arxiv.org/abs/2502.16982

[2] (far from a complete list) https://docs.pytorch.org/docs/stable/optim.html#algorithms

[3] (I think similar types of questions may also be fruitful) https://mathoverflow.net/questions/42617/functions-whose-gra...

fmap · 2025-05-20T16:17:05 1747757825

The reason deep learning is alchemy is that none of these deep theories have predictive ability.

Essentially all practical models are discovered by trial and error and then "explained" after the fact. In many papers you read a few paragraphs of derivation followed by a simpler formulation that "works better in practice". E.g., diffusion models: here's how to invert the forward diffusion process, but actually we don't use this, because gradient descent on the inverse log likelihood works better. For bonus points the paper might come up with an impressive name for the simple thing.

In most other fields you would not get away with this. Your reviewers would point this out and you'd have to reformulate the paper as an experience report, perhaps with a section about "preliminary progress towards theoretical understanding". If your theory doesn't match what you do in practice - and indeed many random approaches will kind of work (!) - then it's not a good theory.

theahura · 2025-05-20T16:23:20 1747758200

It's true that there is no directly predictive model of deep learning, and it's also true that there is some trial and error, but it is wrong to say that therefore there is no operating theory at all. I recommend reading Ilyas 30 papers (here's my review of that set: https://open.substack.com/pub/theahura/p/ilyas-30-papers-to-...) to see how shared intuitions and common threads are clearly developed over the last decade+

niemandhier · 2025-05-21T08:33:02 1747816382

That is a great list, do you know of something similar that is more recent?

fmap · 2025-05-06T21:31:21 1746567081

The usual way to check whether a definition is correct is to prove properties about it that you think should hold. TLA+ has good support for this, both with model checking as well as simple proofs.

fmap · 2025-04-04T12:01:19 1743768079

> even if this doesn’t lead to AGI, at the very least it’s likely the final “warning shot” we’ll get before it’s suddenly and irreversibly here.

I agree that it's good science fiction, but this is still taking it too seriously. All of these "projections" are generalizing from fictional evidence - to borrow a term that's popular in communities that push these ideas.

Long before we had deep learning there were people like Nick Bostrom who were pushing this intelligence explosion narrative. The arguments back then went something like this: "Machines will be able to simulate brains at higher and higher fidelity. Someday we will have a machine simulate a cat, then the village idiot, but then the difference between the village idiot and Einstein is much less than the difference between a cat and the village idiot. Therefore accelerating growth[...]" The fictional part here is the whole brain simulation part, or, for that matter, any sort of biological analogue. This isn't how LLMs work.

We never got a machine as smart as a cat. We got multi-paragraph autocomplete as "smart" as the average person on the internet. Now, after some more years of work, we have multi-paragraph autocomplete that's as "smart" as a smart person on the internet. This is an imperfect analogy, but the point is that there is no indication that this process is self-improving. In fact, it's the opposite. All the scaling laws we have show that progress slows down as you add more resources. There is no evidence or argument for exponential growth. Whenever a new technology is first put into production (and receives massive investments) there is an initial period of rapid gains. That's not surprising. There are always low-hanging fruit.

We got some new, genuinely useful tools over the last few years, but this narrative that AGI is just around the corner needs to die. It is science fiction and leads people to make bad decisions based on fictional evidence. I'm personally frustrated whenever this comes up, because there are exciting applications which will end up underfunded after the current AI bubble bursts...

afpx · 2025-04-05T13:13:45 1743858825

If you gather up a couple million of the smartest people on earth along with a few trillion dollars, and you add in super ambitious people eager to be culturally deified, you significantly increase the chance for breakthroughs. It's all probabiliities, though. But, right now there's no better game to bet on.

tim333 · 2025-04-04T13:08:42 1743772122

>There is no evidence or argument for exponential growth

I think the growth you are thinking of, self improving AI, needs the AI to be as smart as a human developer/researcher to get going and we haven't got there yet. But we quite likely will at some point.

maerF0x0 · 2025-04-04T16:05:29 1743782729

and the article specifically mentions the fictional company (clearly designed to generalize the Google/OpenAI's of the world) are supposedly (according to the article) working on building that capability. First by augmenting human researchers, later by augmenting itself.

gwd · 2025-04-04T13:05:54 1743771954

> Someday we will have a machine simulate a cat, then the village idiot... This isn't how LLMs work.

I think you misunderstood that argument. The simulate the brain thing isn't a "start from the beginning" argument, it's an "answer a common objection" argument.

Back around 2000, when Nick Bostrom was talking about this sort of thing, computers were simply nowhere near powerful enough to come even close to being smart enough to outsmart a human, except in very constrained cases like chess; we did't even have the first clue how to create a computer program to be even remotely dangerous to us.

Bostrom's point was that, "We don't need to know the computer program; even if we just simulate something we know works -- a biological brain -- we can reach superintelligence in a few decades." The idea was never that people would actually simulate a cat. The idea is, if we don't think of anything more efficient, we'll at least be able to simulate a cat, and then an idiot, and then Einstein, and then something smarter. And since we almost certainly will think of something more efficient than "simulate a human brain", we should expect superintelligence to come much sooner.

> There is no evidence or argument for exponential growth.

Moore's law is exponential, which is where the "simulate a brain" predictions have come from.

> It is science fiction and leads people to make bad decisions based on fictional evidence.

The only "fictional evidence" you've actually specified so far is the fact that there's no biological analog; and that (it seems to me) is from a misunderstanding of a point someone else was making 20 years ago, not something these particular authors are making.

I think the case for AI caution looks like this:

A. It is possible to create a superintelligent AI

B. Progress towards a superintelligent AI will be exponential

C. It is possible that a superintelligent AI will want to do something we wouldn't want it to do; e.g., destroy the whole human race

D. Such an AI would be likely to succeed.

Your skepticism seems to rest on the fundamental belief that either A or B is false: that superintelligence is not physically possible, or at least that progress towards it will be logarithmic rather than exponential.

Well, maybe that's true and maybe it's not; but how do you know? What justifies your belief that A and/or B are false so strongly, that you're willing to risk it? And not only willing to risk it, but try to stop people who are trying to think about what we'd do if they are true?

What evidence would cause you to re-evaluate that belief, and consider exponential progress towards superintelligence possible?

And, even if you think A or B are unlikely, doesn't it make sense to just consider the possibility that they're true, and think about how we'd know and what we could do in response, to prevent C or D?

Vegenoid · 2025-04-04T19:22:24 1743794544

> Moore's law is exponential, which is where the "simulate a brain" predictions have come from.

To address only one thing out of your comment, Moore's law is not a law, it is a trend. It just gets called a law because it is fun. We know that there are physical limits to Moore's law. This gets into somewhat shaky territory, but it seems that current approaches to compute can't reach the density of compute power present in a human brain (or other creatures' brains). Moore's law won't get chips to be able to simulate a human brain, with the same amount of space and energy as a human brain. A new approach will be needed to go beyond simply packing more transistors onto a chip - this is analogous to my view that current AI technology is insufficient to do what human brains do, even when taken to their limit (which is significantly beyond where they're currently at).

fmap · 2025-04-04T14:05:58 1743775558

> The idea is, if we don't think of anything more efficient, we'll at least be able to simulate a cat, and then an idiot, and then Einstein, and then something smarter. And since we almost certainly will think of something more efficient than "simulate a human brain", we should expect superintelligence to come much sooner.

The problem with this argument is that it's assuming that we're on a linear track to more and more intelligent machines. What we have with LLMs isn't this kind of general intelligence.

We have multi-paragraph autocomplete that's matching existing texts more and more closely. The resulting models are great priors for any kind of language processing and have simple reasoning capabilities in so far as those are present in the source texts. Using RLHF to make the resulting models useful for specific tasks is a real achievement, but doesn't change how the training works or what the original training objective was.

So let's say we continue along this trajectory and we finally have a model that can faithfully reproduce and identify every word sequence in its training data and its training data includes every word ever written up to that point. Where do we go from here?

Do you want to argue that it's possible that there is a clever way to create AGI that has nothing to do with the way current models work and that we should be wary of this possibility? That's a much weaker argument than the one in the article. The article extrapolates from current capabilities - while ignoring where those capabilities come from.

> And, even if you think A or B are unlikely, doesn't it make sense to just consider the possibility that they're true, and think about how we'd know and what we could do in response, to prevent C or D?

This is essentially https://plato.stanford.edu/entries/pascal-wager/

It might make sense to consider, but it doesn't make sense to invest non-trivial resources.

This isn't the part that bothers me at all. I know people who got grants from, e.g., Miri to work on research in logic. If anything, this is a great way to fund some academic research that isn't getting much attention otherwise.

The real issue is that people are raising ridiculous amounts of money by claiming that the current advances in AI will lead to some science fiction future. When this future does not materialize it will negatively affect funding for all work in the field.

And that's a problem, because there is great work going on right now and not all of it is going to be immediately useful.

0x008 · 2025-04-05T08:26:22 1743841582

I think the idea with LLMs leading to AGI is more like:

Natural language is a fuzzy context aware state machine of some sorts that can theoretically represent any arbitrarily complex state in the outside world given enough high quality text.

And by reiterating and extrapolating the rules found in human communication an AI could by the sheer ability to simulate infinitely long discussions discover new things, given the ability to independently verify outcomes.

hannasanarion · 2025-04-04T17:17:01 1743787021

> So let's say we continue along this trajectory and we finally have a model that can faithfully reproduce and identify every word sequence in its training data and its training data includes every word ever written up to that point. Where do we go from here?

This is a fundamental misunderstanding of the entire point of predictive models (and also of how LLMs are trained and tested).

For one thing, ability to faithfully reproduce texts is not the primary scoring metric being used for the bulk of LLM training and hasn't been for years.

But more importantly, you don't make a weather model so that it can inform you of last Tuesday's weather given information from last Monday, you use it to tell you tomorrow's weather given information from today. The totality of today's temperatures, winds, moistures, and shapes of broader climatic patterns, particulates, albedos, etc etc etc have never happened before, and yet the model tells us something true about the never-before-seen consequences of these never-before-seen conditions, because it has learned the ability to reason new conclusions from new data.

Are today's "AI" models a glorified autocomplete? Yeah, but that's what all intelligence is. The next word I type is the result of an autoregressive process occurring in my brain that produces that next choice based on the totality of previous choices and experiences, just like the Q-learners that will kick your butt in Starcraft choose the best next click based on their history of previous clicks in the game combined with things they see on the screen, and will have pretty good guesses about which clicks are the best ones even if you're playing as Zerg and they only ever trained against Terran.

A highly accurate autocomplete that is able to predict the behavior and words of a genius, when presented with never before seen evidence, will be able to make novel conclusions in exactly the same way as the human genius themselves would when shown the same new data. Autocomplete IS intelligence.

New ideas don't happen because intelligences draw them out of the aether, they happen because intelligences produce new outputs in response to stimuli, and those stimuli can be self-inputs, that's what "thinking" is.

If you still think that all today's AI hubbub is just vacuous hype around an overblown autocomplete, try going to Chatgpt right now. Click the "deep research" button, and ask it "what is the average height of the buildings in [your home neighborhood]"?, or "how many calories are in [a recipe that you just invented]", or some other inane question that nobody would have ever cared to write about ever before but is hypothetically answerable from information on the internet, and see if what you get is "just a reproduced word sequence from the training data".

gwd · 2025-04-04T20:37:35 1743799055

> We have multi-paragraph autocomplete that's matching existing texts more and more closely.

OK, I think I see where you're coming from. It sounds like what you're saying is:

E. LLMs only do multi-paragraph autocomplete; they are and always will be incapable of actual thinking.

F. Any approach capable of achieving AGI will be completely different in structure. Who knows if or when this alternate approach will even be developed; and if it is developed, we'll be starting from scratch, so we'll have plenty of time to worry about progress then.

With E, again, it may or may not be true. It's worth noting that this is a theoretical argument, not an empirical one; but I think it's a reasonable assumption to start with.

However, there are actually theoretical reasons to think that E may be false. The best way to predict the weather is to have an internal model which approximates weather systems; the best way to predict the outcome of a physics problem is to have an internal model which approximates the physics of the thing you're trying to predict. And the best way to predict what a human would write next is to have a model of a human mind -- including a model of what the human mind has in its model (e.g., the state of the world).

There is some empirical data to support this argument, albeit in a very simplified manner: They trained a simple LLM to predict valid moves for Othello, and then probed it and discovered an internal Othello board being simulated inside the neural network:

https://thegradient.pub/othello/

And my own experience with LLMs better match the "LLMs have an internal model of the world" theory than the "LLMs are simply spewing out statistical garbage" theory.

So, with regard to E: Again, sure, LLMs may turn out to be a dead end. But I'd personally give the idea that LLMs are a complete dead end a less than 50% probability; and I don't think giving it an overwhelmingly high probability (like 1 in a million of being false) is really reasonable, given the theoretical arguments and empirical evidence against it.

With regard to F, again, I don't think this is true. We've learned so much about optimizing and distilling neural nets, optimizing training, and so on -- not to mention all the compute power we've built up. Even if LLMs are a dead end, whenever we do find an architecture capable of achieving AGI, I think a huge amount of the work we've put into optimizing LLMs will put is way ahead in optimizing this other system.

> ...that the current advances in AI will lead to some science fiction future.

I mean, if you'd told me 5 years ago that I'd be able to ask a computer, "Please use this Golang API framework package to implement CRUD operations for this particular resource my system has", and that the resulting code would 1) compile out of the box, 2) exhibit an understanding of that resource and how it relates to other resources in the system based on having seen the code implementing those resources 3) make educated guesses (sometimes right, sometimes wrong, but always reasonable) about details I hadn't specified, I don't think I would have believed you.

Even if LLM progress is logarithmic, we're already living in a science fiction future.

EDIT: The scenario actually has very good technical "asides"; if you want to see their view of how a (potentially dangerous) personality emerges from "multi-paragraph auto-complete", look at the drop-down labelled "Alignment over time", and specifically what follows "Here’s a detailed description of how alignment progresses over time in our scenario:".

https://ai-2027.com/#alignment-over-time

gwd · 2025-04-05T20:26:23 1743884783

> E. LLMs only do multi-paragraph autocomplete; they are and always will be incapable of actual thinking.

FWIW, this guy thinks E is true, and that he has a better direction to head in:

https://www.youtube.com/watch?v=ETZfkkv6V7Y

HN discussion about a related article I didn't read:

https://news.ycombinator.com/item?id=43562768

vonneumannstan · 2025-04-04T15:04:13 1743779053

>All of these "projections" are generalizing from fictional evidence - to borrow a term that's popular in communities that push these ideas.

This just isn't correct. Daniel and others on the team are experienced world class forecasters. Daniel wrote another version of this in 2021 predicting the AI world in 2026 and was astonishingly accurate. This deserves credence.

https://www.lesswrong.com/posts/6Xgy6CAf2jqHhynHL/what-2026-...

>he arguments back then went something like this: "Machines will be able to simulate brains at higher and higher fidelity.

Complete misunderstanding of the underlying ideas. Just in not even wrong territory.

>We got some new, genuinely useful tools over the last few years, but this narrative that AGI is just around the corner needs to die. It is science fiction and leads people to make bad decisions based on fictional evidence.

You are likely dangerously wrong. The AI field is near universal in predicting AGI timelines under 50 years. With many under 10. This is an extremely difficult problem to deal with and ignoring it because you think it's equivalent to overpopulation on mars is incredibly foolish.

https://www.metaculus.com/questions/5121/date-of-artificial-...

https://wiki.aiimpacts.org/doku.php?id=ai_timelines:predicti...

loganmhb · 2025-04-04T15:55:20 1743782120

I respect the forecasting abilities of the people involved, but I have seen that report described as "astonishingly accurate" a few times and I'm not sure that's true. The narrative format lends itself somewhat to generous interpretation and it's directionally correct in a way that is reasonably impressive from 2021 (e.g. the diplomacy prediction, the prediction that compute costs could be dramatically reduced, some things gesturing towards reasoning/chain of thought) but many of the concrete predictions don't seem correct to me at all, and in general I'm not sure it captured the spiky nature of LLM competence.

I'm also struck by the extent to which the first series from 2021-2026 feels like a linear extrapolation while the second one feels like an exponential one, and I don't see an obvious justification for this.

vonneumannstan · 2025-04-07T13:32:21 1744032741

Yeah funny how that works: https://149909199.v2.pressablecdn.com/wp-content/uploads/201...

Workaccount2 · 2025-04-04T15:43:56 1743781436

>2025:...Making models bigger is not what’s cool anymore. They are trillions of parameters big already. What’s cool is making them run longer, in bureaucracies of various designs, before giving their answers.

Dude was spot on in 2021, hot damn.

fennecfoxy · 2025-04-10T11:14:02 1744283642

I mean MoE/agent work was being done in 2021 I'm pretty sure. Definitely more accurate than most predictions but perhaps not revolutionary to state that the tail follows the dog.

arduanika · 2025-04-05T19:20:52 1743880852

https://www.wikihow.com/Leave-a-Cult

whiplash451 · 2025-04-04T14:08:18 1743775698

> there are exciting applications which will end up underfunded after the current AI bubble bursts

Could you provide examples? I am genuinely interested.

fmap · 2025-04-05T06:54:32 1743836072

I'm personally very excited about the progress in interactive theorem proving. Before the current crop of deep learning heuristics there was no generally useful higher-order automated theorem proving system. Automated theorem proving could prove individual statements based on existing lemmas, but that only works in extremely restricted settings (propositional or first-order logic). The problem is that in order to apply a statement of the form "for all functions with this list of properties, ..." you need to come up with a function that's related to what you're trying to prove. This is equivalent to coming up with new lemmas and definitions, which is the actually challenging part of doing mathematics or verification.

There has finally been progress here, which is why you see high-profile publications from, e.g., Deepmind about solving IMO problems in Lean. This is exciting, because if you're working in a system like Coq or Lean your progress is monotone. Everything you prove actually follows from the definitions you put in. This is in stark contrast to, e.g., using LLMs for programming, where you end up with a tower of bugs and half-working code if you don't constantly supervise the output.

---

But well, the degree of excitement is my own bias. From other people I spoke to recently: - Risk-assessment diagnostics in medicine. There are a bunch of tests that are expensive and complex to run and need a specialist to evaluate. Deep learning is increasingly used to make it possible to do risk assessments with cheaper automated tests for a large population and have specialists focus on actual high-risk cases. Progress is slow for various reasons, but it has a lot of potential. - Weather forecasting uses a sparse set of inputs: atmospheric data from planes, weather baloons, measurements at ground stations, etc. This data is then aggregated with relatively stupid models to get the initial conditions to run a weather simulation. Deep learning is improving this part, but while there has been some encouraging initial progress this needs to be better integrated with existing simulations (purely deep learning based approaches are apparently a lot worse at predicting extreme weather events). Those simulations are expensive, they're running on some of the largest supercomputers in the world, which is why progress is slow.

whiplash451 · 2025-04-04T14:10:46 1743775846

There is no need to simulate Einstein to transform the world with AI.

A self-driving car would already be plenty.

skydhash · 2025-04-04T17:08:26 1743786506

And a self driving car is not even necessary if we’re thinking about solving transportation problems. Train and bus are better at solving road transportation at scale.

fmap · 2024-12-30T09:53:33 1735552413

Adding to this, inference is getting cheaper and more efficient all the time. The investment bubble is probably the biggest reason why inference hardware is so expensive at the moment and why startups in this sector are only targeting large scale applications.

Once this bubble bursts, local inference will become even more affordable than it already is. There is no way that there will be a moat around running models as a service.

---

Similarly, there probably won't be a "data moat". The whole point of large foundation models is that they are great priors. You need relatively few examples to fine tune an LLM or diffusion model to get it to do what you want. So long as someone releases up to date foundation models there is no moat here either.

fmap · on July 25, 2024

Agreed, this is a big step forward. Geometry problems are in a different class, since you can translate them into systems of polynomial equations and use well known computer algebra algorithms to solve them.

By contrast, this kind of open ended formalization is something where progress used to be extremely slow and incremental. I worked in an adjacent field 5 years ago and I cannot stress enough that these results are simply out of reach for traditional automated reasoning techniques.

Real automatic theorem proving is also useful for a lot more than pure mathematics. For example, it's simple to write out an axiomatic semantics for a small programming language in lean and pose a question of the form "show that there exists a program which satisfies this specification".

If this approach scales it'll be far more important than any other ML application that has come out in the last few years.

cubefox · on July 25, 2024

> Agreed, this is a big step forward. Geometry problems are in a different class, since you can translate them into systems of polynomial equations and use well known computer algebra algorithms to solve them.

The blog post indicates the opposite. The geometry problem in the IMO problem set was solved by AlphaGeometry 2, which is an LLM based on Google's Gemini. LLMs are considered relatively general systems. But the other three solved problems were proved by AlphaProof, which is a narrow RL system that is literally based on AlphaZero, the Go and Chess AI. Only its initial (bootstrapping) human training data (proofs) were formalized and augmented by an LLM (Gemini).

dash2 · on July 26, 2024

AlphaZero is more general than a Go and Chess AI, right? Isn't it a general self-play algorithm?

sapiogram · on July 26, 2024

Only slightly more general. It only works for games that are zero-sum, deterministic, have no hidden information, and discrete game state and moves. Other examples include connect-4.

cubefox · on July 26, 2024

So finding Lean proofs can be conceptualized as a zero-sum game?

Another basic requirement is that valid moves / inference steps and the winning condition can be efficiently verified using some non-AI algorithm. Otherwise there would not be a reward signal for the reinforcement learning algorithm. This is different from answering most natural language questions, where the answer can't be checked trivially.

fmap · on July 26, 2024

Theorem proving can be formulated as a game, see e.g., https://plato.stanford.edu/entries/logic-games/ and interactive theorem provers can verify that a proof is correct (and related sub problems, such as that a lemma application is valid).

Conceptually, if you're trying to show a conjunction, then it's the other player's turn and they ask you for a proof of a particular case. If you're trying to show a disjunction then it's your turn and you're picking a case. "Forall" is a potentially infinite conjunction, "exists" is a potentially infinite disjunction.

In classical logic this collapses somewhat, but the point is that this is still a search problem of the same kind. If you want to feel this for yourself, try out some proofs in lean or coq. :)

sapiogram · on July 26, 2024

I don't think AlphaZero is related to this work, apart from both being NN-based. AlphaZero and its training pipeline fundamentally only works for "chess-like" two-player games, where the agent can play against itself and slowly improve through MCTS.

adroniser · on July 26, 2024

"AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go."

redman25 · on July 26, 2024

I think that it can only play “games” for which it has perfect information about the state of the “game”.

lacker · on July 26, 2024

AlphaGeometry is not an LLM

cubefox · on July 26, 2024

It is an LLM combined with a symbolic deduction engine.

fmap · on June 26, 2024

You don't get new theorems if you remove assumptions. Rather, you get the ability to add different assumptions.

The Banach-Tarski paradox shows that classical set theory makes the wrong assumptions to intrinsically model measure theory and probability.

There are other systems which don't suffer from this paradox and hence don't need all the machinery of sigma algebras and measurable sets.

I wish there was a good accessible book/article/blog post about this, but as is you'd have to Google point-free topology or topos of probability (there are several).

skhunted · on June 26, 2024

I think the following is a valid question. At least I hope it isn’t completely stupid.

Is there a known set theory of the form ZF+(something) which relatively consistent with ZFC in which additive, isometry invariant measures exist?

I guess what you are saying is that the only known, reasonable way around this is the topos notion you mentioned.

fmap · on June 22, 2024

This is NYSE specific: https://www.nyse.com/article/parity-priority-explainer

nkaz001 · on June 22, 2024

Thank you! I know that NASDAQ and CME also have designated market makers, but I couldn't find any specific privileges they have over non-market makers, apart from the different fee schedules. Is the ability of NYSE designated market makers to handle order fills a common practice across other exchanges?

fmap · on May 20, 2024

For this particular question we have a great answer in the form of homotopy type theory. It handles all the complications that the author mentions. This is one of the reasons type theorists were excited about HoTT in the first place.

The only problem is that the author is working in Lean and apparently* dismissive of non-classical type theory. So now we're back to lamenting that equality in type theory is broken...

*) I'm going by second hand accounts here, please correct me if you think this statement is too strong.

cwzwarich · on May 20, 2024

> For this particular question we have a great answer in the form of homotopy type theory. It handles all the complications that the author mentions. This is one of the reasons type theorists were excited about HoTT in the first place.

Arguably, Homotopy type theory still doesn't solve the problem, because while it strengthens the consequences of isomorphism it doesn't distinguish between "isomorphism" and "canonical isomorphism", whereas (some?) mathematicians informally seem to think that there's a meaningful difference.

> The only problem is that the author is working in Lean and apparently* dismissive of non-classical type theory. So now we're back to lamenting that equality in type theory is broken...

In my opinion, Lean made a number pragmatic choices (impredicative Prop, non-classical logic including the axiom of global choice, uniqueness of identity proofs, etc.) that enable the practical formalization of mathematics as actually done by the vast majority of mathematicians who don't research logic, type theory, or category theory.

It's far from established that this is possible at all with homotopy type theory, yet alone whether it would actually be easier or more economical to do so. And even if this state of affairs is permanent, homotopy type theory itself would still be an interesting topic of study like any other field of mathematics.

lupire · on May 20, 2024

"Canonical" refers largely to the names we give to things, so that we maximize the extent to which an isomorphism maps the object named A in X to the object named A in Y. And also to choosing an isomorphism that embeds in a different isomorphism (of superstructures) that isn't strictly the topic of the current question.

I don't think anyone thinks canonical isomorphisms are mathematically controversial (except some people having fun with studying scenarios where more than one isomorphism is equally canonical, and other meta studies), they are a convenient communication shorthand for avoiding boring details.

https://en.m.wikipedia.org/wiki/Isomorphism_theorems

moomin · on May 20, 2024

I think the original author makes a good case that a) the shorthand approach isn’t going to fly as we formalise b) different mathematicians mean different things by canonical and these details matter when trying to combining results and that therefore 3) it would be better to provide a proper definition of what you’re talking about and give it an unambiguous name.

euiq · on May 20, 2024

let me guess: you haven't read the article, have you?

xigoi · on May 21, 2024

> Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".

https://news.ycombinator.com/newsguidelines.html

fmap · on May 20, 2024

> it doesn't distinguish between "isomorphism" and "canonical isomorphism"

You can distinguish these concepts in (higher )category theory, where isomorphisms are morphisms in groupoids and canonical isomorphisms are contractible spaces of morphisms. These sound like complicated concepts, but in HoTT you can discover the same phenomena as paths in types (i.e., equality) and a simple definition of contractibility which looks and works almost exactly like unique existence.

> In my opinion, Lean made a number pragmatic choices (impredicative Prop, non-classical logic including the axiom of global choice, uniqueness of identity proofs, etc.) that enable the practical formalization of mathematics as actually done by the vast majority of mathematicians who don't research logic, type theory, or category theory.

The authors of Lean are brilliant and I'm extremely happy that more mathematicians are looking into formalisations and recognizing the additional challenges that this brings. At the same time, it's a little bit depressing that we had finally gotten a good answer to many (not all!) of these additional challenges only to then retreat back to familiar ground.

Anyway, there were several responses to my original comment and instead of answering each individually, I'm just going to point to the article itself. The big example from section 5 is that of localisations of R-algebras. Here is how this whole discussion changes in HoTT:

1) We have a path R[1/f][1/g] = R[1/fg], therefore the original theorem is applicable in the case that the paper mentions.

2) The statements in terms of "an arbitrary localisation" and "for all particular localisations" are equivalent.

3) ...and this is essentially because in HoTT there is a way to define the localisation of an R-algebra at a multiplicative subset. This is a higher-inductive type and the problematic aspects of the definition in classical mathematics stem from the fact that this is not (automatically) a Set. A higher-inductive definition is a definition by a universal property, yet you can work with this in the same way as you would with a concrete construction and the theorems that the article mentions are provable.

---

There is much more that can be said here and it's not all positive. The only point I want to make is that everybody who ever formalised anything substantial is well aware of the problem the article talks about: you need to pick the correct definitions to formalise, you can't just translate a random math textbook and expect it to work. Usually, you need to pick the correct generalisation of a statement which actually applies in all the cases where you need to use it.

Type theory is actually especially difficult here, because equality in type theory lacks many of the nice properties that you would expect it to have, on top of issues around isomorphisms and equality that are left vague in many textbooks/papers/etc. HoTT really does solve this issue. It presents a host of new questions and it's almost certain that we haven't found the best presentation of these ideas yet, but even the presentation we have solves the problems that the author talks about in this article.

analytiqhub · on May 21, 2024

Homotopy type theory is definitely the way to go. This 2016 talk from Vladimir Voevodsky is a good entrypoint for the general audience:

https://www.youtube.com/watch?v=gOBqROtRoPA

aureianimus · on May 21, 2024

I'd love to hear a little bit more on what you think the downsides are? (Or a recommendation for a resource to read up on this?)

zozbot234 · on May 21, 2024

I'd also like to know more about that. My comment points out some things about HoTT that might be described as "downsides" by some but are more like things that people might expect from HoTT given that it 'handles' type equality in a rigorous yet flexible way, but aren't actually feasible in a rigorous treatment, i.e. the system is not going to pick up all the issues that are "brushed aside" in an informal treatment and resolve them on its own. Its behavior will point out to you that these issues exist, but you'll still have to do a lot of work dotting the i's and crossing the t's, as with any formalization.

zozbot234 · on May 20, 2024

> It handles all the complications that the author mentions. This is one of the reasons type theorists were excited about HoTT in the first place.

It's not that simple, Buzzard actually seems to be quite familiar with HoTT. The HoTT schtick of defining equality between types as isomorphism does 'handle' these issues for you in the sense that they are treated rigorously, but there's no free lunch in that. You still need to do all the usual work establishing compatibility of values, functions etc. with the relevant isomorphisms, so other than the presentation being somewhat more straightforward there is not much of a gain in usability.

zozbot234 · on May 21, 2024

Sorry about not providing a reference re: Kevin Buzzard's previously-stated thoughts on HoTT, but see https://ncatlab.org/nlab/show/Kevin+Buzzard

jhanschoo · on May 20, 2024

Can you elaborate more on how you think HoTT can solve the issue Buzzard mentions that when formalizing informal mathematics, the author frequently has a model (or, a particular construction) in mind and requires properties present in the model, and then claims that this is true for all "canonically equivalent" objects without actually spending time explicating the abstract nonsense required to properly specify the properties that canonically equivalent objects preserve, for their developement? (see: a product vs. the product)

Edit: and furthermore, the situation where the obvious choice of universal property (read: level of isomorphism) was a poor one, in their attempt to formalize localizations.

YetAnotherNick · on May 20, 2024

I think 99.9% of the people talking about HoTT haven't used HoTT in practice. They just see isomorphism is equivalent to equivalence(or something similar), comes up with a mental model and think this is the main contribution of HoTT, whereas isomorphism is a well studied thing even before formal set theory was a thing. HoTT gets weird and anyone who has tried any tool to prove anything in HoTT knows this, and this is the reason why it didn't got adopted even though many leading tools like lean worked hard to implement it.

staunton · on May 20, 2024

Have you used it? If yes, could you elaborate? Or if you haven't, can you point to any recent source on what it's like to be using HoTT in 2024?

YetAnotherNick · on May 20, 2024

[1] is the only video which I would say I found any bit accessible. But to be honest even after spending days, I couldn't grok it. And in my observation, different implementations of HoTT are quite different from each other.

[1]: https://www.youtube.com/watch?v=x4cz1OgpU3M

bumbledraven · on May 20, 2024

For reference, the paper mentions the term "homotopy" exactly once, in this sentence:

> The set-theoretic definition is too strong (Cauchy reals and Dedekind reals are certainly not equal as sets) and the homotopy type theoretic definition is too weak (things can be equal in more than one way, in contrast to Grothendieck’s usage of the term).

voxl · on May 20, 2024

Classical versus non-classical has nothing to do with it. HoTT does not work to capture what a mathematician means by "unique isomorphism" because witnesses of equality in HoTT are very much not unique.

nicklecompte · on May 20, 2024

I think HOTT helps remove a lot of the footguns and computational hairiness around using type-theoretic mathematics. But the issues Buzzard is talking about are really more profound than that. This relates to a much deeper (but familiar and nontechnical) philosophical problem: you assign a symbol to a thing, and read two sentences that refer to the same symbol ("up to isomorphism"), but not necessarily the same thing. If the sentences contradict each other, is it because they disagree on the meaning of the symbol, or do they disagree on the essence of the thing? There is a strong argument that most debates about free will and consciousness are really about what poorly-understood thing(s) we are assigning to the symbol. Worse, this makes such debates ripe for bad faith, where flaws in an argument are defended by implicitly changing the meaning of the symbol.

In the context of """simple""" mathematics, preverbal toddlers and chimpanzees clearly have an innate understanding of quantity and order. It's only after children fully develop this innate understanding that there's any point in teaching them "one," "two," "three," and thereby giving them the tools for handling larger numbers. I don't think it makes sense to say that toddlers understand the Peano axioms. Rather, Peano formulated the axioms based on his own (highly sophisticated) innate understanding of number. But given he spent decades of pondering the topic, it seems like Peano's abstract conception of "number" became different from (say) Kronecker's, or other constructivists/etc. Simply slapping the word "integer" on two different concepts and pointing out that they coincide for quantities we can comprehend doesn't actually do anything by itself to address the discrepancy in concept revealed by Peano allowing unbounded integers and Kronecker's skepticism. (The best argument against constructivism is essentially sociological and pragmatic, not "mathematically rational.")

Zooming out a bit, I suspect we (scientifically-informed laypeople + many scientists) badly misunderstand the link between language and human cognition. It seems more likely to me that we have extremely advanced chimpanzee brains that make all sorts of sophisticated chimpanzee deductions, including the extremely difficult question of "what is a number?", but to be shared (and critically investigated) these deductions have to be squished into language, as a woefully insufficient compromise. And I think a lot of philosophical - and metamathematical - confusion can be understood as a discrepancy between our chimpanzee brains having a largely rigorous understanding of something, but running into limits with our Broca's and Wernicke's areas, limits which may or may not be fixed by "technological development" in human language. (Don't even get me started on GPT...)

psychoslave · on May 21, 2024

The level of insights and clearness in your message, and most comments in this thread actually, is surprising great compared to what I often read on HN.

Thank you very much for this

Ericson2314 · on May 20, 2024

The fact that Buzzard is writing this paper at all to me shows a much higher degree of respect than before --- see how humble the overview is. I think this is great progress!

fmap · on March 19, 2024

Others have already mentioned this, but Alan Kay spent his entire career experimenting with approaches to radically simplify computing. The progress reports from vpri give a good overview of a recent project that has since wrapped up, e.g.: https://tinlizzie.org/VPRIPapers/tr2012001_steps.pdf

YeGoblynQueenne · on March 19, 2024

>> Others have already mentioned this, but Alan Kay spent his entire career experimenting with approaches to radically simplify computing.

Yes. And yet, he created OOP [1]. Strange.

________________

[1] Not on his own.

jdlshore · on March 19, 2024

You’re saying this like it’s a “gotcha,” which makes me believe you’re confusing OOP-as-practiced-in-enterprise with OOP-as-invented-by-Kay.

The sad truth is that most great ideas are doomed to drown in a sea of mediocrity and misinterpretation.

YeGoblynQueenne · on March 19, 2024

I'm not confusing anything. OOP is a horribly OOT over-engineering of a problem looking for a solution, from the get go. That it morphed and twisted into a bloated mess when it was taken up by people who didn't get it to begin with is just the inevitable result of starting with a big pile of no need to do all that in the first place.

Phiwise_ · on March 20, 2024

[Citation Needed]