SimulaVR[1][2] is releasing our standalone (belt-strappable) compute packs this year, which will (ii) come pre-installed with our FOSS Linux VR Desktop compositor and (ii) work with AR headsets like the Rokid Max series (and potentially the XReal headsets). So basically: you'll get full Linux Desktop apps in AR (not just Android ones) with actual VR window management (not just "dumb monitor mode").
[1] I know we're taking forever D: But we intend for this to be a way to release an intermediate product (which we've been making anyway for our full headsets).
rr and Gdb are very DWARF debugging focussed. As long as Haskell has only basic DWARF debugging support I wonder how much rr/gdb can do.
Though I do see a lot of promise in the future. rr can help make premature evaluation (many expressions are evaluated earlier than they might typically happen in a real Haskell program because a user may want to inspect a value) in the debugger not matter so much because that evaluation can be executed in a diversion session.
Did anyone else notice that o3-mini's SWE bench dropped from 61% in the leaked System Card earlier today to 49.3% in this blog post, which puts o3-mini back in line with Claude on real-world coding tasks?
I think this is with and without "tools." They explain it in the system card:
> We evaluate SWE-bench in two settings:
> *• Agentless*, which is used for all models except o3-mini (tools). This setting uses the Agentless 1.0 scaffold, and models are given 5 tries to generate a candidate patch. We compute pass@1 by averaging the per-instance pass rates of all samples that generated a valid (i.e., non-empty) patch. If the model fails to generate a valid patch on every attempt, that instance is considered incorrect.
> *• o3-mini (tools)*, which uses an internal tool scaffold designed for efficient iterative file editing and debugging. In this setting, we average over 4 tries per instance to compute pass@1 (unlike Agentless, the error rate does not significantly impact results). o3-mini (tools) was evaluated using a non-final checkpoint that differs slightly from the o3-mini launch candidate.
So am I to understand that they used their internal tooling scaffold on the o3(tools) results only? Because if so, I really don't like that.
While it's nonetheless impressive that they scored 61% on SWE-bench with o3-mini combined with their tool scaffolding, comparing Agentless performance with other models seems less impressive, 40% vs 35% when compared to o1-mini if you look at the graph on page 28 of their system card pdf (https://cdn.openai.com/o3-mini-system-card.pdf).
It just feels like data manipulation to suggest that o3-mini is much more performant than past models. A fairer picture would still paint a performance improvement, but it look less exciting and more incremental.
Of course the real improvement is cost, but still, it kind of rubs me the wrong way.
YC usually says “a startup is the point in your life where tricks stop working”.
Sam Altman is somehow finding this out now, the hard way.
Most paying customers will find out within minutes whether the models can serve their use case, a benchmark isn’t going to change that except for media manipulation (and even that doesn’t work all that well, since journalists don’t really know what they are saying and readers can tell).
My guess is this cheap mini-model comes out now after DeepSeek very recently shook the stock-market greatly with its cheap price and relatively good performance. .
> including with the open-source Agentless scaffold (39%) and an internal tools scaffold (61%), see our system card .
I have no idea what an "internal tools scaffold" is but the graph on the card that they link directly to specifies "o3-mini (tools)" where the blog post is talking about others.
Instead of just generating a patch (copilot style), it generates the patch, applies the patch, runs the code, and then iterates based on the execution output.
Anyone can write very fast software if you don't mind it sometimes crashing or having weird bugs.
Why do people try to meme as if AI is different? It has unexpected outputs sometimes, getting it to not do that is 50% "more alignment" and 50% "hallucinate less".
Just today I saw someone get the Amazon bot to roleplay furry erotica. Funny, sure, but it's still obviously a bug that a *sales bot* would do that.
And given these models do actually get stuff wrong, is it really incorrect for them to refuse to help with things they might be dangerous if the user isn't already skilled, like Claude in this story about DIY fusion? https://www.corememory.com/p/a-young-man-used-ai-to-build-a-...
If somebody wants their Amazon bot to role play as an erotic furry, that’s up to them, right? Who cares. It is working as intended if it keeps them going back to the site and buying things I guess.
I don’t know why somebody would want that, seems annoying. But I also don’t expect people to explain why they do this kind of stuff.
Who determines who gets access to what information? The OpenAI board? Sam? What qualifies as dangerous information? Maybe it’s dangerous to allow the model to answer questions about a person. What happens when limiting information becomes a service you can sell? For the right price anything can become too dangerous for the average person to know about.
The reports are public, and if you don't feel like reading them because they're too long and thorough in their explanations of what and why you can always put them into an AI and ask it to summarise them for you.
OpenAI is allowed to unilaterally limit the capability of their own models, just like any other software company can unilaterally limit the performance of their own software.
And they still are even when they're just blantantly wrong or even just lazy — it's not like people complain about Google "lobotomising" their web browsers for no longer supporting Flash or Java applets.
They are implying the release was rushed and they had to reduce the functionality of the model in order to make sure it did not teach people how to make dirty bombs
The problem is that they don't make the LLM better at instruction following, they just make it unable to product furry erotica even if Amazon wants it to.
The advice I've always been given in (admittedly: small) business startup sessions was "focus on quality rather than price because someone will always undercut you on price".
The models are in a constant race on both price and quality, but right now they're so cheap that paying for the best makes sense for any "creative" task (like writing software, even if only to reduce the number of bugs the human code reviewer needs to fix), while price sensitivity only matters for the grunt work classification tasks (such as "based on comments, what is the public response to this policy?")
1. Suppose we have a `./nixos_binary_program_with_glibc-newer` compiled on a NixOS machine against bleeding edge `glibc-newer`.
2. `./nixos_binary_program_with_glibc-newer` will have `/nix/store/glibc-newer/linux-ld.so` path hardcoded into its ELF header which will be used when the program launches to find all of the program's shared libraries, and so forth. (And this is a fact that `ldd` will obfuscate!).
3. When `./nixos_binary_program_with_glibc-newer` is distributed to machines which use `glibc-older` instead of `glibc-newer`, the hardcoded `linux-ld.so` from (2) will fail to be found, leading to a launch error.
4. (3) will also happen on machines which don't use nix in the first place.
=======Will's Solution========
1. Use `patchelf` to hardcode a standard FHS `ld-linux.so` location into `nixos_binary_program_with_glibc-newer`'s ELF header (using e.g. `/lib64/ld-linux-x86-64.so.2` as the path)
2. Use a metaloader to launch `nixos_binary_program_with_glibc-newer` with an augmented `RPATH` which has a bunch of different `/nix/store/glibc-newer` paths, so that nix machines can find a suitable `ld-linux.so` to launch the program with.
This will make `nixos_binary_program_with_glibc-newer` work on any machine, including both non-nix machines and nix machines (which might be running older versions of glibc by default)!
Assuming a 45 liter tank you'd spend €84.60. Mind that a LOT of that is tax. If you strip off Value Added Tax -- equivalent to sales tax if I understand correctly, though a fair bit higher -- you'd pay €70.50. That still leaves excise tax, which is levied specifically on goods the government wants to discourage. Cigarettes, Alcohol, Gas. Without that you'd pay €34.99.
I'm not sure how the tax breakdown works in Texas, but I'd not be surprised if it'd be one of the main drivers of the price difference.
Understood. We opted to make tradeoffs that optimize for VR pixel density over headset lightness (since we think pixel density/text readability is ultimately a bigger barrier for long sessions than weight).
With that said, our compute pack itself is detachable, so it can be offloaded from the head.
[1] I know we're taking forever D: But we intend for this to be a way to release an intermediate product (which we've been making anyway for our full headsets).
[2] Our next blog update will be about this. Here's a video video preview: https://youtube.com/shorts/Y67D8DkqScU?si=LpdSpjmfGn2k2rxP