Thief of Time by Terry Prachett has a great minor bit about characters who are naming themselves after colors running out of human made labels, as they have to get increasingly esoteric with the names.
It's fun to see that visualized.
In the middle term, I almost feel less productive using modern GPT-5/Claude Sonnet 4 for software dev than prior models, precisely because they are more hands off and less supervised.
Because they generate so much code, that often passes initial tests, looks reasonable, and fails in nonhuman ways, in a pretty opinionated style tbh.
I have less context (and need to spend much more effort and supervision time to get up to speed to learn) to fix, refactor, and integrate the solutions, than if I was only trusting short few line windows at a time.
> I almost feel less productive using modern GPT-5/Claude Sonnet 4 for software dev than prior models, precisely because they are more hands off and less supervised.
That is because you are trained in the old way to writing code: manual crafting of software line by line, slowly, deliberately, thoughtfully. New generations of developers will not use the same workflow as you, just like you do not use the same workflow as folks who programmed punch cards.
No, it's because reading code is slower than writing it.
The only way these tools can possibly be faster for non-trivial work is if you don't give a shit enough about the output to not even read it. And if you can do that and still achieve your goal, chances are your goal wasn't that difficult to begin with.
That's why we're now consistently measuring individuals to be slower using these tools even though many of them feel faster.
> No, it's because reading code is slower than writing it.
This feels wrong to me, unless we qualify the statement with: "...if you want the exact same level of understanding of it."
Otherwise, the bottleneck in development would be pull/merge request review, not writing the code to do something. But almost always, it's the other way around - someone works on a feature for 3-5 days, the pull/merge request does not really spend the same time in active review. I don't think you need the exact same level of intricate understanding over some code when reviewing it.
It's quite similar with the AI stuff, I often nitpick and want to rework certain bits of code that AI generates (or fix obvious issues with it), but using it for the first version/draft is still easier than trying to approach the issue from zero. Ofc AI won't make you consistently better, but will remove some of the friction and reduce the cognitive load.
I have measured it myself within my organization, and I know many peers across companies who have done the same. No, I cannot share the data (I wish I could, truly), but I expect that we will begin to see many of these types of studies emerge before long.
The tools are absolutely useful, but they need to be applied in the right places and they are decided not a silver bullet or general-purpose software engineering tool in the manner that they're being billed at present. We still use them despite our finding, but we use them judiciously and where they actually help.
Useful and interesting but likely still dangerous in production without connecting to formal verification tools.
I know o3 is far from state of the art these days but it's great at finding relevant literature and suggesting inequalities to consider but in actual proofs it can produce convincing looking statements that are false if you follow the details, or even just the algebra, carefully. Subtle errors like these might become harder to detect as the models get better.
100% o3 has a strong bias towards "write something that looks like a formal argument that appears to answer the question" over writing something sound.
I gave it a bunch of recent, answered MathOverflow questions - graduate level maths queries. Sometimes it would get demonstrably the wrong answer, but it not be easy to see where it had gone wrong (e.g. some mistake in a morass of algebra). A wrong but convincing argument is the last thing you want!
For people who are interested Kokkos (a C++ library for writing portable kernels) also has a naming scheme for hierarchical parallelism. They use ThreadTeam, Thread (for individual threads within a group), and ThreadVector (for per thread SIMD).
Just commenting to share, personally I have no naming preference but the hierarchal abstractions in general are incredibly useful.
I can't reproduce this, possibly it's another environment error or the problem has been fixed in the versions I am using. (uv 0.7.11, zsh 5.9, python 3.13.5, MacOS 15.3.1)
Yep! Optimize (solve in infinite dimensions) and then discretize onto a finite basis has typically led to much better and stable methods than a discretize and then optimize approach.
Time-scale calculus is a pretty niche theoretical field that looks at blending the analysis of difference and differential equations, but I'm not aware of any algorithmic advances based on it.
Does anyone know why they added minibatch advantage normalization (or when it can be useful)?
The paper they cite "What matters in on-policy RL" claims it does not lead to much difference on their suite of test problems, and (mean-of-minibatch)-normalization doesn't seem theoretically motivated for convergence to the optimal policy?
I am not sure why HN has mostly LANL posts.
Otherwise though it is a combination of things. Machine learning applications for NATSec & fundamental research have become more important (see FASST, proposed last year), the current political environment makes AI funding and applications more secure and easier to chase, and some of this is work that has already been going on but getting greater publicity for both of those reasons.
what are the problems you're talking about? your references seem to refer to reproducing scientific publications, dependency issues, and cell execution ordering.
this project appears to be intended for operational documentation / living runbooks. it doesn't really seem like the same use case.
Agreed. The problem with reproducing Jupyter runbooks in academia is that someone thought a Jupyter runbook is a way to convey information from one person to another. Those are an awful model for that.
As an on-the-fly debugging tool, they're great: you get a REPL that isn't actively painful to use, a history (roughly, since the state is live and every cell is not run every time) of commands run, and visualization at key points in the program to check as you go your assumptions are sound.
agreed - we actually have a dependency system in the works too!
you can define + declare ordering with dependency specification on the edges of the graph (ie A must run before B, but B can run as often as you'd like within 10 mins of A)
There of course should be a way to override the dependency, by explicitly pressing a big scary "[I know what I'm doing]" button.
Another thing is that you'll need branches. As in:
- Run `foo bar baz`
- If it succeeds, run `foo quux`,
Else run `rm -rf ./foo/bar` and rerun the previous command with `--force` option.
- `ls ./foo/bar/buur` and make certain it exists.
Different branches can be separated visually; one can be collapsed if another is taken.
Writing robust runbooks is not that easy. But I love the idea of mixing the explanatory text and various types of commands together.
The use case this addresses is 'adhoc activites must be performed without being totally chaotic'.
Obviously a nice one-click/trigger based CI/CD deployment pipeline is lovely, but uh, this is the real world. There are plenty of cases where that's simply either not possible, or not worth the effort to setup.
I think this is great; if I have one suggestion it would just be integrated logging so there's an immutable shared record of what was actually done as well. I would love to be able to see that Bob started the 'recover user profile because db sync error' runbook but didn't finish running it, and exactly when that happened.
If you think it's a terrible idea, then uh, what's your suggestion?
I'm pretty tired of copy-pasting commands from confluence. I think that's, I dunno, unambiguously terrible, and depressingly common.
One time scripts that are executed in a privileged remote container also works, but at the end of that day, those script tend to be specific and have to be invoked with custom arguments, which, guess what, usually turn up as a sequence of operations in a runbook; query db for user id (copy-paste SQL) -> run script with id (copy paste to terminal) -> query db to check it worked (copy paste SQL) -> trigger notification workflow with user id if it did (login to X and click on button Y), etc.
I'm not against this notebook style, I have runbooks in Jupyter notebooks.
I just think it's pretty easy to do things like start a flow back up halfway through the book and not fix some underlying ordering issues.
With scripts that you tend to have to run top to bottom you end up having to be more diligent with making sure the initial steps are still OK because on every test you tend to run everything. Notebook style environments favor running things piecemeal. Also very helpful! It introduces a much smaller problem in the process of solving the larger issue of making it easier to do this kind of work in the first place.
Literate programming really needs the ability to reorder, otherwise it’s just sparkling notebooks. (Except for Haskell, which is order-independent enough as it is that the distinction rarely matters.)
“A reactive notebook for Python — run reproducible experiments, query with SQL, execute as a script, deploy as an app, and version with git. *All in a modern, AI-native editor.*
Why does it need to be in a “modern, AI-native editor”?