More

georgewsinger · 2025-04-13T19:13:30 1744571610

SimulaVR[1][2] is releasing our standalone (belt-strappable) compute packs this year, which will (ii) come pre-installed with our FOSS Linux VR Desktop compositor and (ii) work with AR headsets like the Rokid Max series (and potentially the XReal headsets). So basically: you'll get full Linux Desktop apps in AR (not just Android ones) with actual VR window management (not just "dumb monitor mode").

[1] I know we're taking forever D: But we intend for this to be a way to release an intermediate product (which we've been making anyway for our full headsets).

[2] Our next blog update will be about this. Here's a video video preview: https://youtube.com/shorts/Y67D8DkqScU?si=LpdSpjmfGn2k2rxP

georgewsinger · 2025-03-30T15:58:28 1743350308

Still waiting for rr to work more transparently/easily with Haskell.

sidkshatriya · 2025-03-30T16:59:10 1743353950

rr and Gdb are very DWARF debugging focussed. As long as Haskell has only basic DWARF debugging support I wonder how much rr/gdb can do.

Though I do see a lot of promise in the future. rr can help make premature evaluation (many expressions are evaluated earlier than they might typically happen in a real Haskell program because a user may want to inspect a value) in the debugger not matter so much because that evaluation can be executed in a diversion session.

georgewsinger · 2025-03-27T21:31:40 1743111100

This is such an insightful comment. Now that I see it, I can't see unsee it.

georgewsinger · 2025-01-31T19:24:28 1738351468

Did anyone else notice that o3-mini's SWE bench dropped from 61% in the leaked System Card earlier today to 49.3% in this blog post, which puts o3-mini back in line with Claude on real-world coding tasks?

Am I missing something?

anothermathbozo · 2025-01-31T19:31:20 1738351880

I think this is with and without "tools." They explain it in the system card:

> We evaluate SWE-bench in two settings: > *• Agentless*, which is used for all models except o3-mini (tools). This setting uses the Agentless 1.0 scaffold, and models are given 5 tries to generate a candidate patch. We compute pass@1 by averaging the per-instance pass rates of all samples that generated a valid (i.e., non-empty) patch. If the model fails to generate a valid patch on every attempt, that instance is considered incorrect.

> *• o3-mini (tools)*, which uses an internal tool scaffold designed for efficient iterative file editing and debugging. In this setting, we average over 4 tries per instance to compute pass@1 (unlike Agentless, the error rate does not significantly impact results). o3-mini (tools) was evaluated using a non-final checkpoint that differs slightly from the o3-mini launch candidate.

Bjorkbat · 2025-01-31T20:03:19 1738353799

So am I to understand that they used their internal tooling scaffold on the o3(tools) results only? Because if so, I really don't like that.

While it's nonetheless impressive that they scored 61% on SWE-bench with o3-mini combined with their tool scaffolding, comparing Agentless performance with other models seems less impressive, 40% vs 35% when compared to o1-mini if you look at the graph on page 28 of their system card pdf (https://cdn.openai.com/o3-mini-system-card.pdf).

It just feels like data manipulation to suggest that o3-mini is much more performant than past models. A fairer picture would still paint a performance improvement, but it look less exciting and more incremental.

Of course the real improvement is cost, but still, it kind of rubs me the wrong way.

pockmarked19 · 2025-01-31T20:28:25 1738355305

YC usually says “a startup is the point in your life where tricks stop working”.

Sam Altman is somehow finding this out now, the hard way.

Most paying customers will find out within minutes whether the models can serve their use case, a benchmark isn’t going to change that except for media manipulation (and even that doesn’t work all that well, since journalists don’t really know what they are saying and readers can tell).

galaxyLogic · 2025-02-01T08:25:28 1738398328

My guess is this cheap mini-model comes out now after DeepSeek very recently shook the stock-market greatly with its cheap price and relatively good performance. .

IanCal · 2025-02-01T14:48:00 1738421280

o3 mini has been coming for a while, and iirc was "a couple of weeks" away a few weeks ago before R1 hit the news.

georgewsinger · 2025-01-31T19:35:00 1738352100

Makes sense. Thanks for the correction.

jakereps · 2025-01-31T19:30:40 1738351840

The caption on the graph explains.

> including with the open-source Agentless scaffold (39%) and an internal tools scaffold (61%), see our system card .

I have no idea what an "internal tools scaffold" is but the graph on the card that they link directly to specifies "o3-mini (tools)" where the blog post is talking about others.

DrewHintz · 2025-01-31T20:12:00 1738354320

I'm guessing an "internal tools scaffold" is something like Goose: https://github.com/block/goose

Instead of just generating a patch (copilot style), it generates the patch, applies the patch, runs the code, and then iterates based on the execution output.

logicchains · 2025-01-31T19:27:41 1738351661

Maybe they found a need to quantize it further for release, or lobotomise it with more "alignment".

ben_w · 2025-01-31T20:15:46 1738354546

> lobotomise

Anyone can write very fast software if you don't mind it sometimes crashing or having weird bugs.

Why do people try to meme as if AI is different? It has unexpected outputs sometimes, getting it to not do that is 50% "more alignment" and 50% "hallucinate less".

Just today I saw someone get the Amazon bot to roleplay furry erotica. Funny, sure, but it's still obviously a bug that a *sales bot* would do that.

And given these models do actually get stuff wrong, is it really incorrect for them to refuse to help with things they might be dangerous if the user isn't already skilled, like Claude in this story about DIY fusion? https://www.corememory.com/p/a-young-man-used-ai-to-build-a-...

bee_rider · 2025-01-31T21:12:33 1738357953

If somebody wants their Amazon bot to role play as an erotic furry, that’s up to them, right? Who cares. It is working as intended if it keeps them going back to the site and buying things I guess.

I don’t know why somebody would want that, seems annoying. But I also don’t expect people to explain why they do this kind of stuff.

ben_w · 2025-01-31T21:29:31 1738358971

It's still a bug. Not really working as intended — it doesn't sell anything from that.

A very funny bug, but a bug nonetheless.

And given this was shared via screenshots, it was done for a laugh.

thrwthsnw · 2025-02-01T13:30:30 1738416630

Who determines who gets access to what information? The OpenAI board? Sam? What qualifies as dangerous information? Maybe it’s dangerous to allow the model to answer questions about a person. What happens when limiting information becomes a service you can sell? For the right price anything can become too dangerous for the average person to know about.

ben_w · 2025-02-01T19:37:09 1738438629

> What qualifies as dangerous information?

The reports are public, and if you don't feel like reading them because they're too long and thorough in their explanations of what and why you can always put them into an AI and ask it to summarise them for you.

OpenAI is allowed to unilaterally limit the capability of their own models, just like any other software company can unilaterally limit the performance of their own software.

And they still are even when they're just blantantly wrong or even just lazy — it's not like people complain about Google "lobotomising" their web browsers for no longer supporting Flash or Java applets.

Rastonbury · 2025-01-31T20:27:05 1738355225

They are implying the release was rushed and they had to reduce the functionality of the model in order to make sure it did not teach people how to make dirty bombs

stavros · 2025-02-01T09:29:12 1738402152

The problem is that they don't make the LLM better at instruction following, they just make it unable to product furry erotica even if Amazon wants it to.

AbstractH24 · 2025-02-01T15:26:25 1738423585

> Anyone can write very fast software if you don't mind it sometimes crashing or having weird bugs.

Isn’t that exactly what VCs want?

ben_w · 2025-02-01T19:49:26 1738439366

I doubt it.

The advice I've always been given in (admittedly: small) business startup sessions was "focus on quality rather than price because someone will always undercut you on price".

The models are in a constant race on both price and quality, but right now they're so cheap that paying for the best makes sense for any "creative" task (like writing software, even if only to reduce the number of bugs the human code reviewer needs to fix), while price sensitivity only matters for the grunt work classification tasks (such as "based on comments, what is the public response to this policy?")

kkzz99 · 2025-01-31T19:29:20 1738351760

Or the number was never real to begin with.

georgewsinger · on July 10, 2024

=======Technical Summary========

Here's a problem with NixOS:

1. Suppose we have a `./nixos_binary_program_with_glibc-newer` compiled on a NixOS machine against bleeding edge `glibc-newer`.

2. `./nixos_binary_program_with_glibc-newer` will have `/nix/store/glibc-newer/linux-ld.so` path hardcoded into its ELF header which will be used when the program launches to find all of the program's shared libraries, and so forth. (And this is a fact that `ldd` will obfuscate!).

3. When `./nixos_binary_program_with_glibc-newer` is distributed to machines which use `glibc-older` instead of `glibc-newer`, the hardcoded `linux-ld.so` from (2) will fail to be found, leading to a launch error.

4. (3) will also happen on machines which don't use nix in the first place.

=======Will's Solution========

1. Use `patchelf` to hardcode a standard FHS `ld-linux.so` location into `nixos_binary_program_with_glibc-newer`'s ELF header (using e.g. `/lib64/ld-linux-x86-64.so.2` as the path)

2. Use a metaloader to launch `nixos_binary_program_with_glibc-newer` with an augmented `RPATH` which has a bunch of different `/nix/store/glibc-newer` paths, so that nix machines can find a suitable `ld-linux.so` to launch the program with.

This will make `nixos_binary_program_with_glibc-newer` work on any machine, including both non-nix machines and nix machines (which might be running older versions of glibc by default)!

georgewsinger · on June 4, 2024

That's because it is!

This was an interesting post. It's a shame it somehow lost its front-page status.

georgewsinger · on June 1, 2024

Do you have any sense as to whether Tengwar is any more or less compact/concise than Shavian?

georgewsinger · on Dec 29, 2023

I live near Texas. I just filled my car's entire tank of gas (1990 Miata) for ~$22.

How are gas prices in Europe these days?

usrnm · on Dec 29, 2023

I live in Europe and have never owned a car in my life. I don't even have a license. Overall, I like how things are handled here

hjadal · on Dec 29, 2023

Here in Denmark it would be around $95 to fill your Miata. Assuming it has a 45l tank.

bobthepanda · on Dec 29, 2023

To be fair, the US also has very low rates of gas tax compared to other OECD countries

Doxin · on Dec 29, 2023

Assuming a 45 liter tank you'd spend €84.60. Mind that a LOT of that is tax. If you strip off Value Added Tax -- equivalent to sales tax if I understand correctly, though a fair bit higher -- you'd pay €70.50. That still leaves excise tax, which is levied specifically on goods the government wants to discourage. Cigarettes, Alcohol, Gas. Without that you'd pay €34.99.

I'm not sure how the tax breakdown works in Texas, but I'd not be surprised if it'd be one of the main drivers of the price difference.

gmac · on Dec 29, 2023

High, mainly because of high tax rates. This is a good thing, because it reflects the social cost of carbon (plus pollutants, road-building, and other externalities). https://theconversation.com/what-is-the-social-cost-of-carbo...

georgewsinger · on Nov 10, 2023

This is a good way to put it.

georgewsinger · on Oct 18, 2023

Understood. We opted to make tradeoffs that optimize for VR pixel density over headset lightness (since we think pixel density/text readability is ultimately a bigger barrier for long sessions than weight).

With that said, our compute pack itself is detachable, so it can be offloaded from the head.

tmikaeld · on Oct 22, 2023

Hm, so the compute is at the back of the head?

What is the weight without the compute pack or with pack?

Couldn't find weight on your site