I suspect it's just circumstantial - two different design approaches. Both of the approaches have their advantages and disadvantages.
IMHO the bigger issue with NaN-boxing is that on 64-bit systems it relies on the address space only needing <50 bits or so, as the discriminator is stored on the high bits. It's ok for now when virtual address spaces typically only need 48 bits of representation, but that's already starting to slip with newer systems.
On the other hand, I love the fact that NaN-boxing basically lets you eliminate all heap allocations for doubles.
I actually wrote a small article a while back on a hybrid approach called Ex-boxing (exponent boxing), which tries to get at the best of both worlds: decouple the boxing representation from virtual address significant bits, and also represent most (almost all) doubles that show up at runtime as immediates.
> IMHO the bigger issue with NaN-boxing is that on 64-bit systems it relies on the address space only needing <50 bits or so, as the discriminator is stored on the high bits.
Is this right? You get 51 tag bits, of which you must use one to distinguish pointer-to-object from other uses of the tag bits (assuming Huffman-ish coding of tags). But objects are presumedly a minimum of 8-byte sized and aligned, and on most platforms I assume they'd be 16-byte sized and aligned, which means the low three (four) bits of the address are implicit, giving 53 (54) bit object addresses. This is quite a few years of runway...
There's a bit of time yes, but for an engine that relies on this format (e.g. spidermonkey), the assumptions associated with the value boxing format would have leaked into the codebase all over the place. It's the kind of thing that's far less painful to take care of when you don't need to do it than when you need to do it.
But fair point on the aligned pointers - that would give you some free bits to keep using, but it gets ugly.
You're right about the 51 bits - I always get mixed up about whether it's 12 bits of exponent, or the 12 includes the sign. Point is it puts some hard constraints on a pretty large number of high bits of a pointer being free, as opposed to an alignment requirement for low-bit tagging which will never run out of bits.
I think this is an attempt to try to enrich the locality model in transformers.
One of the weird things you do in transformers is add a position vector which captures the distance between the token being attended to the some other token.
This is obviously not powerful enough to express non-linear relationships - like graph relationships.
This person seems to be experimenting with doing pre-processing of the input token set, to linearly reorder it by some other heuristic that might map more closely to the actual underlying relationship between each token.
Once upon a time during me being language modeling researcher I built and finetuned a big (at the time - about 5 billions parameters) Sparse Non-Negative Matrix Language Model [1].
As this model allows for mix-and-match of various contexts, one thing that I did is to have a word-sorted context. This effectively transforms position-based context into a word-set based context, where "you and me", "me and you" and "and me you" are the same.
This allowed for longer contexts and better prediction.
I've saved it to look at it in the future. I also remembered Kristina Tautanova's name (your editor). Looking up recent publications, she's done interesting work on analyzing pretraining mixtures.
Well, in your work, whay benefit did you get from it? And do you think it would be beneficial today combined with modern techniques? Or obsoleted by other technqiue?
(I ask because I'm finding many old techniques are still good or could be mixed with deep learning.)
It was not bad, but I had trouble scaling it to the 1B set. Mostly because I have not enough time.
I do hold same mindset as yours, that many old techniques are misunderstood or underapplied. For example, decision trees, in my experiments, allow for bit-length-per-byte comparable to LSTM (lstm-compress or LSTM in nncp experiments): https://github.com/thesz/codeta
Adding the position vector is basic sure, but it's naive to think the model doesn't develop its own positional system bootstrapping on top of the barebones one.
> This is obviously not powerful enough to express non-linear relationships - like graph relationships.
the distance metrics used is based on energy-informed graphs that encode energy relations in a distribution called taumode, see my previous paper on spectral indexing for vector databases for a complete roll-out
Writing a GC in rust without just dropping the whole business into unsafe is really annoying.
Jason Orendorff has an implementation of a GC in rust called "cell-gc" that seemed like only one I've seen so far that seemed to "get" how to marry rust to the requirements of a GC implementation: https://github.com/jorendorff/cell-gc
Still has a lot of unsafe code and macro helpers, but it's laid out well and documented pretty well. Not sure if you've run across it yet.
I'd have to take a contrary view on that. It'll take some time for the technologies to be developed, but ultimately managed JIT compilation has the potential to exceed native compiled speeds. It'll be a fun journey getting there though.
The initial order-of-magnitude jump in perf that JITs provided took us from the 5-2x overhead for managed runtimes down to some (1 + delta)x. That was driven by runtime type inference combined with a type-aware JIT compiler.
I expect that there's another significant, but smaller perf jump that we haven't really plumbed out - mostly to be gained from dynamic _value_ inference that's sensitive to _transient_ meta-stability in values flowing through the program.
Basically you can gather actual values flowing through code at runtime, look for patterns, and then inline / type-specialize those by deriving runtime types that are _tighter_ than the annotated types.
I think there's a reasonable amount of juice left in combining those techniques with partial specialization and JIT compilation, and that should get us over the hump from "slightly slower than native" to "slightly faster than native".
I get it's an outlier viewpoint though. Whenever I hear "managed jitcode will never be as fast as native", I interpret that as a friendly bet :)
> JIT compilation has the potential to exceed native compiled speeds
The battlecry of Java developers riding their tortoises.
Don’t we have decades of real-world experience showing native code almost always performs better?
For most things it doesn’t matter, but it always rubs me the wrong way when people mention this about JIT since it almost never works that way in the real world (you can look at web framework benchmarks as an easy example)
It's not that surprising to people who are old enough to have lived through the "reality" of "interpreted languages will never be faster than about 2x compiled languages".
The idea that an absurdly dynamic language like JS, where all objects are arbitrary property bags with prototypical dependency chains that are runtime mutable, would execute at a tech budget under 2x raw performance was just a matter of fact impossibility.
Until it wasn't. And the technology reason it ended up happening was research that was done in the 80s.
It's not surprising to me that it hasn't happened yet. This stuff is not easy to engineer and implement. Even the research isn't really there yet. Most of the modern dynamic language JIT ideas which came to the fore in the mid 200X's were directly adapting research work on Self from about two decades prior.
Dynamic runtime optimization isn't too hot in research right now, and it never was to be honest. Most of the language theory folks tend to lean more in the type theory direction.
The industry attention too has shifted away. Browsers were cutting edge a while back and there was a lot of investment in core research tech associated with that, but that's shifting more to the AI space now.
Overall the market value prop and the landscape for it just doesn't quite exist yet. Hard things are hard.
You nailed it -- the tech enabling JS to match native speed was Self research from the 80s, adapted two decades later. Let me fill in some specifics from people whose papers I highly recommend, and who I've asked questions of and had interesting discussions with!
Vanessa Freudenberg [1], Craig Latta [2], Dave Ungar [3], Dan Ingalls, and Alan Kay had some great historical and fresh insights. Vanessa passed recently -- here's a thread where we discussed these exact issues:
Vanessa had this exactly right. I asked her what she thought of using WASM with its new GC support for her SqueakJS [1] Smalltalk VM.
Everyone keeps asking why we don't just target WebAssembly instead of JavaScript. Vanessa's answer -- backed by real systems, not thought experiments -- was: why would you throw away the best dynamic runtime ever built?
To understand why, you need to know where V8 came from -- and it's not where JavaScript came from.
David Ungar and Randall B. Smith created Self [3] in 1986. Self was radical, but the radicalism was in service of simplicity: no classes, just objects with slots. Objects delegate to parent objects -- multiple parents, dynamically added and removed at runtime. That's it.
The Self team -- Ungar, Craig Chambers, Urs Hoelzle, Lars Bak -- invented most of what makes dynamic languages fast: maps (hidden classes), polymorphic inline caches, adaptive optimization, dynamic deoptimization [4], on-stack replacement. Hoelzle's 1992 deoptimization paper blew my mind -- they delivered simplicity AND performance AND debugging.
That team built Strongtalk [5] (high-performance Smalltalk), got acquired by Sun and built HotSpot (why Java got fast), then Lars Bak went to Google and built V8 [6] (why JavaScript got fast). Same playbook: hidden classes, inline caching, tiered compilation. Self's legacy is inside every browser engine.
Brendan Eich claims JavaScript was inspired by Self. This is an exaggeration based on a deep misunderstanding that borders on insult. The whole point of Self was simplicity -- objects with slots, multiple parents, dynamic delegation, everything just another object.
JavaScript took "prototypes" and made them harder than classes: __proto__ vs .prototype (two different things that sound the same), constructor functions you must call with "new" (forget it and "this" binds wrong -- silent corruption), only one constructor per prototype, single inheritance only. And of course == -- type coercion so broken you need a separate === operator to get actual equality. Brendan has a pattern of not understanding equality.
The ES6 "class" syntax was basically an admission that the prototype model was too confusing for anyone to use correctly. They bolted classes back on top -- but it's just syntax sugar over the same broken constructor/prototype mess underneath. Twenty years to arrive back at what Smalltalk had in 1980, except worse.
Self's simplicity was the point. JavaScript's prototype system is more complicated than classes, not less. It's prototype theater. The engines are brilliant -- Self's legacy. The language design fumbled the thing it claimed to borrow.
Vanessa Freudenberg worked for over two decades on live, self-supporting systems [9]. She contributed to Squeak EToys, Scratch, and Lively. She was co-founder of Croquet Corp and principal engineer of the Teatime client/server architecture that makes Croquet's replicated computation work. She brought Alan Kay's vision of computing into browsers and multiplayer worlds.
SqueakJS [7] was her masterpiece -- a bit-compatible Squeak/Smalltalk VM written entirely in JavaScript. Not a port, not a subset -- the real thing, running in your browser, with the image, the debugger, the inspector, live all the way down. It received the Dynamic Languages Symposium Most Notable Paper Award in 2024, ten years after publication [1].
The genius of her approach was the garbage collection integration. It amazed me how she pulled a rabbit out of a hat -- representing Squeak objects as plain JavaScript objects and cooperating with the host GC instead of fighting it. Most VM implementations end up with two garbage collectors in a knife fight over the heap. She made them cooperate through a hybrid scheme that allowed Squeak object enumeration without a dedicated object table. No dueling collectors. Just leverage the machinery you've already paid for.
But it wasn't just technical cleverness -- it was philosophy. She wrote:
"I just love coding and debugging in a dynamic high-level language. The only thing we could potentially gain from WASM is speed, but we would lose a lot in readability, flexibility, and to be honest, fun."
"I'd much rather make the SqueakJS JIT produce code that the JavaScript JIT can optimize well. That would potentially give us more speed than even WASM."
Her guiding principle: do as little as necessary to leverage the enormous engineering achievements in modern JS runtimes [8]. Structure your generated code so the host JIT can optimize it. Don't fight the platform -- ride it.
She was clear-eyed about WASM: yes, it helps for tight inner loops like BitBlt. But for the VM as a whole? You gain some speed and lose readability, flexibility, debuggability, and joy. Bad trade.
This wasn't conservatism. It was confidence.
Vanessa understood that JS-the-engine isn't the enemy -- it's the substrate. Work with it instead of against it, and you can go faster than "native" while keeping the system alive and humane. Keep the debugger working. Keep the image snapshotable. Keep programming joyful. Vanessa knew that, and proved it!
Yeah I've heard this my whole career, and while it sounds great it's been long enough that we'd be able to list some major examples by now.
What are the real world chances that a) one's compiled code benefits strongly from runtime data flow analysis AND b) no one did that analysis at the compilation stage?
Some sort of crazy off label use is the only situation I think qualifies and that's not enough.
Compiled Lua vs LuaJIT is a major example imho, but maybe it's not especially pertinent given the looseness of the Lua language. I do think it demonstrates that the concept that it is possible to have a tighter type-system at runtime than at compile time (that can in turn result in real performant benefits) is a sound concept, however.
The major Javascript engines already have the concept of a type system that applies at runtime. Their JITs will learn the 'shapes' of objects that commonly go through hot-path functions and will JIT against those with appropriate bailout paths to slower dynamic implementations in case a value with an unexpected 'shape' ends up being used instead.
There's a lot of lore you pick up with Javascript when you start getting into serious optimization with it; and one of the first things you learn in that area is to avoid changing the shapes of your objects because it invalidates JIT assumptions and results in your code running slower -- even though it's 100% valid Javascript.
Totally agree on js, but it doesn't have the same easy same-language comparison that you get from compiled Lua vs LuaJIT. Although I suppose you could pre-compile JavaScript to a binary with eg QuickJS but I don't think this is as apples-to-apples comparison as compiled Lua to LuaJIT.
Any optimizations discovered at runtime by a JIT can also be applied to precompiled code. The precompiled code is then not spending runtime cycles looking for patterns, or only doing so in the minimally necessary way. So for projects which are maximally sensitive to performance, native will always be capable of outperforming JIT.
It's then just a matter of how your team values runtime performance vs other considerations such as workflow, binary portability, etc. Virtually all projects have an acceptable range of these competing values, which is where JIT shines, in giving you almost all of the performance with much better dev economics.
I think you can capture that constraint as "anything that requires finely deterministic high performance is out of reach of JIT-compiled outputs".
Obviously JITting means you'll have a compiler executing sometimes along with the program which implies a runtime by construction, and some notion of warmup to get to a steady state.
Where I think there's probably untapped opportunity is in identifying these meta-stable situations in program execution. My expectation is that there are execution "modes" that cluster together more finely than static typing would allow you to infer. This would apply to runtimes like wasm too - where the modes of execution would be characterized by the actual clusters of numeric values flowing to different code locations and influencing different code-paths to pick different control flows.
You're right that on the balance of things, trying to say.. allocate registers at runtime will necessarily allow for less optimization scope than doing it prior.
But, if you can be clever enough to identify, at runtime, preferred code-paths with higher resolution than what (generic) PGO allows (because now you can respond to temporal changes in those code-path profiles), then you can actually eliminate entire codepaths from the compiler's consideration. That tends to greatly affect the register pressure (for the better).
It might be interesting just to profile some wasm executions of common programs. If there are transient clusterings of control flow paths that manifest during execution. It'd be a fun exercise...
The American working class doesn't like to acknowledge its own existence or assert its self-worth. There's no real self identity for that class in America.
In fact, a huge number of the people that are in that class would resent you for classifying them in this way. And the same is true for those in the upper middle class, or elites.
Secondly, trying to scope the xenophobia problem to just the working class is itself a bit of a misdirection. Plenty of that comes from the swaths of upper middle class white collar folks. And plenty of it comes from second gen immigrants who are eager to be counted among the natives.
The xenophobia _is_ the substitute American culture provides as a filler for the vacuum left by the lack of any sort of class identity. Everybody falls over themselves demonstrating how they can be "more American" in one way or the other. Who is a "real" American, what their qualities are, whether this particular thing or that particular thing is more or less American, etc. etc.
It's an alternate focus to direct all that shame the culture demands from the poor.
A considerable part of this is the fact that in a society where utilizing these programs is stigmatized to the degree that the USA does, people who see themselves as honest tend to avoid utilizing them.
And even those who are less than honest, but have a sense of propriety, would understand that the correct, culturally approved time to engage in these activities is AFTER one acquires a significant amount of wealth, when entitlements are knighted to become "economic incentives".
I really don't understand what any of this has to do with "trust", especially of the project or code. If anything people who want to gain undeserved trust would be incentivized to appear to follow a higher standard of norms publically. The public comments would be nice and polite and gregarious and professional, and the behaviour that didn't meet that standard would be private.
FWIW I've never programmed a line of code in zig and I don't know who this developer is.
All I got from it was "seems like GitHub is starting to deteriorate pretty hard and this guy's fed up and moving his project and leaving some snark behind".
I think there's still a category theoretic expression of this, but it's not necessarily easy to capture in language type systems.
The notion of `f` producing a lazy sequence of values, `g` consuming them, and possibly that construct getting built up into some closed set of structures - (e.g. sequences, or trees, or if you like dags).
I've only read a smattering of Pi theory, but if I remember correctly it concerns itself more with the behaviour of `f` and `g`, and more generally bridging between local behavioural descriptions of components like `f` and `g` and the global behaviour of a heterogeneous system that is composed of some arbitrary graph of those sending messages to each other.
I'm getting a bit beyond my depth here, but it feels like Pi theory leans more towards operational semantics for reasoning about asynchronicity and something like category theory / monads / arrows and related concepts lean more towards reasoning about combinatorial algebras of computational models.
You'd want to have the alteration reference existing guides to the current implementation.
I haven't jumped in headfirst to the "AI revolution", but I have been systematically evaluating the tooling against various use cases.
The approach that tends to have the best result for me combines a collection of `RFI` (request for implementation) markdown documents to describe the work to be done, as well as "guide" documents.
The guide documents need to keep getting updated as the code changes. I do this manually but probably the more enthusiastic AI workflow users would make this an automated part of their AI workflow.
It's important to keep the guides brief. If they get too long they eat context for no good reason. When LLMs write for humans, they tend to be very descriptive. When generating the guide documents, I always add an instruction to tell the LLM to "be succinct and terse", followed by "don't be verbose". This makes the guides into valuable high-density context documents.
The RFIs are then used in a process. For complex problems, I first get the LLM to generate a design doc, then an implementation plan from that design doc, then finally I ask it to implement it while referencing the RFI, design doc, impl doc, and relevant guide docs as context.
If you're altering the spec, you wouldn't ask it to regen from scratch, but use the guide documents to compute the changes needed to implement the alteration.
Yes, but that is usually more relating to pay/benefits. At google (from what I heard) contractors are put on the bad projects, maintenance work or support functions. As in there is a big separation between work done by full-time employees and contractors in most teams.
I think FTE is mostly used as a 'unit'. E.g. if two people work on something 50% of the time, you get one "FTE-equivalent", as there is roughly one full-time employee of effort put in.
Though in this context it just seems to be the number of people working on the code on a consistent basis.
* “Full Time Employee” (which can itself mean “not a part-timer” in a place that employs both, or “not a temp/contractor” [in which case the “full-time” really means “regular/permanent”]) or
* “Full Time Equivalent” (a budgeting unit equal to either a full time worker or a combination of part time workers with the same aggregate [usually weekly] hours as constitute the standard for full-time in the system being used.)
IMHO the bigger issue with NaN-boxing is that on 64-bit systems it relies on the address space only needing <50 bits or so, as the discriminator is stored on the high bits. It's ok for now when virtual address spaces typically only need 48 bits of representation, but that's already starting to slip with newer systems.
On the other hand, I love the fact that NaN-boxing basically lets you eliminate all heap allocations for doubles.
I actually wrote a small article a while back on a hybrid approach called Ex-boxing (exponent boxing), which tries to get at the best of both worlds: decouple the boxing representation from virtual address significant bits, and also represent most (almost all) doubles that show up at runtime as immediates.
https://medium.com/@kannanvijayan/exboxing-bridging-the-divi...
reply