More

kannanvijayan · 2026-01-18T13:40:27 1768743627

I suspect it's just circumstantial - two different design approaches. Both of the approaches have their advantages and disadvantages.

IMHO the bigger issue with NaN-boxing is that on 64-bit systems it relies on the address space only needing <50 bits or so, as the discriminator is stored on the high bits. It's ok for now when virtual address spaces typically only need 48 bits of representation, but that's already starting to slip with newer systems.

On the other hand, I love the fact that NaN-boxing basically lets you eliminate all heap allocations for doubles.

I actually wrote a small article a while back on a hybrid approach called Ex-boxing (exponent boxing), which tries to get at the best of both worlds: decouple the boxing representation from virtual address significant bits, and also represent most (almost all) doubles that show up at runtime as immediates.

https://medium.com/@kannanvijayan/exboxing-bridging-the-divi...

addaon · 2026-01-18T19:12:29 1768763549

> IMHO the bigger issue with NaN-boxing is that on 64-bit systems it relies on the address space only needing <50 bits or so, as the discriminator is stored on the high bits.

Is this right? You get 51 tag bits, of which you must use one to distinguish pointer-to-object from other uses of the tag bits (assuming Huffman-ish coding of tags). But objects are presumedly a minimum of 8-byte sized and aligned, and on most platforms I assume they'd be 16-byte sized and aligned, which means the low three (four) bits of the address are implicit, giving 53 (54) bit object addresses. This is quite a few years of runway...

kannanvijayan · 2026-01-18T22:48:37 1768776517

There's a bit of time yes, but for an engine that relies on this format (e.g. spidermonkey), the assumptions associated with the value boxing format would have leaked into the codebase all over the place. It's the kind of thing that's far less painful to take care of when you don't need to do it than when you need to do it.

But fair point on the aligned pointers - that would give you some free bits to keep using, but it gets ugly.

You're right about the 51 bits - I always get mixed up about whether it's 12 bits of exponent, or the 12 includes the sign. Point is it puts some hard constraints on a pretty large number of high bits of a pointer being free, as opposed to an alignment requirement for low-bit tagging which will never run out of bits.

kannanvijayan · 2026-01-18T13:17:22 1768742242

I think this is an attempt to try to enrich the locality model in transformers.

One of the weird things you do in transformers is add a position vector which captures the distance between the token being attended to the some other token.

This is obviously not powerful enough to express non-linear relationships - like graph relationships.

This person seems to be experimenting with doing pre-processing of the input token set, to linearly reorder it by some other heuristic that might map more closely to the actual underlying relationship between each token.

thesz · 2026-01-18T22:54:25 1768776865

  > like graph relationships

Once upon a time during me being language modeling researcher I built and finetuned a big (at the time - about 5 billions parameters) Sparse Non-Negative Matrix Language Model [1].

[1] https://aclanthology.org/Q16-1024/

As this model allows for mix-and-match of various contexts, one thing that I did is to have a word-sorted context. This effectively transforms position-based context into a word-set based context, where "you and me", "me and you" and "and me you" are the same.

This allowed for longer contexts and better prediction.

nickpsecurity · 2026-01-19T01:42:25 1768786945

I've saved it to look at it in the future. I also remembered Kristina Tautanova's name (your editor). Looking up recent publications, she's done interesting work on analyzing pretraining mixtures.

https://aclanthology.org/2025.acl-long.1564/

Thanks to you both for two, interesting papers tonight. :)

thesz · 2026-01-19T17:34:26 1768844066

I am not an author of SNMLM paper. ;)

I was using their model in my work.

nickpsecurity · 2026-01-19T23:53:18 1768866798

I misunderstood what you said.

Well, in your work, whay benefit did you get from it? And do you think it would be beneficial today combined with modern techniques? Or obsoleted by other technqiue?

(I ask because I'm finding many old techniques are still good or could be mixed with deep learning.)

thesz · 2026-01-20T01:03:58 1768871038

At the time (2018), it had perplexity close to LSTM, while having more coefficients and much shorter (hours vs days) training time.

I tried to apply SNMLM's ideas to the byte-level prediction modeling here: https://github.com/thesz/snmlm-per-byte

It was not bad, but I had trouble scaling it to the 1B set. Mostly because I have not enough time.

I do hold same mindset as yours, that many old techniques are misunderstood or underapplied. For example, decision trees, in my experiments, allow for bit-length-per-byte comparable to LSTM (lstm-compress or LSTM in nncp experiments): https://github.com/thesz/codeta

adroniser · 2026-01-18T14:16:40 1768745800

Adding the position vector is basic sure, but it's naive to think the model doesn't develop its own positional system bootstrapping on top of the barebones one.

thesz · 2026-01-18T22:56:31 1768776991

For some reason people are still adding position encodings into embeddings.

As if they are not relying on the model's ability to develop its own "positional system bootstrapping on top of the barebones one."

tuned · 2026-01-19T07:08:22 1768806502

> This is obviously not powerful enough to express non-linear relationships - like graph relationships.

the distance metrics used is based on energy-informed graphs that encode energy relations in a distribution called taumode, see my previous paper on spectral indexing for vector databases for a complete roll-out

kannanvijayan · 2026-01-12T12:29:25 1768220965

Writing a GC in rust without just dropping the whole business into unsafe is really annoying.

Jason Orendorff has an implementation of a GC in rust called "cell-gc" that seemed like only one I've seen so far that seemed to "get" how to marry rust to the requirements of a GC implementation: https://github.com/jorendorff/cell-gc

Still has a lot of unsafe code and macro helpers, but it's laid out well and documented pretty well. Not sure if you've run across it yet.

yencabulator · 2026-01-12T17:51:48 1768240308

Did you notice the https://github.com/kyren/gc-arena mentioned (via Lobsters)?

kannanvijayan · 2026-01-09T14:21:06 1767968466

I'd have to take a contrary view on that. It'll take some time for the technologies to be developed, but ultimately managed JIT compilation has the potential to exceed native compiled speeds. It'll be a fun journey getting there though.

The initial order-of-magnitude jump in perf that JITs provided took us from the 5-2x overhead for managed runtimes down to some (1 + delta)x. That was driven by runtime type inference combined with a type-aware JIT compiler.

I expect that there's another significant, but smaller perf jump that we haven't really plumbed out - mostly to be gained from dynamic _value_ inference that's sensitive to _transient_ meta-stability in values flowing through the program.

Basically you can gather actual values flowing through code at runtime, look for patterns, and then inline / type-specialize those by deriving runtime types that are _tighter_ than the annotated types.

I think there's a reasonable amount of juice left in combining those techniques with partial specialization and JIT compilation, and that should get us over the hump from "slightly slower than native" to "slightly faster than native".

I get it's an outlier viewpoint though. Whenever I hear "managed jitcode will never be as fast as native", I interpret that as a friendly bet :)

rudedogg · 2026-01-09T16:01:20 1767974480

> JIT compilation has the potential to exceed native compiled speeds

The battlecry of Java developers riding their tortoises.

Don’t we have decades of real-world experience showing native code almost always performs better?

For most things it doesn’t matter, but it always rubs me the wrong way when people mention this about JIT since it almost never works that way in the real world (you can look at web framework benchmarks as an easy example)

kannanvijayan · 2026-01-09T16:49:11 1767977351

It's not that surprising to people who are old enough to have lived through the "reality" of "interpreted languages will never be faster than about 2x compiled languages".

The idea that an absurdly dynamic language like JS, where all objects are arbitrary property bags with prototypical dependency chains that are runtime mutable, would execute at a tech budget under 2x raw performance was just a matter of fact impossibility.

Until it wasn't. And the technology reason it ended up happening was research that was done in the 80s.

It's not surprising to me that it hasn't happened yet. This stuff is not easy to engineer and implement. Even the research isn't really there yet. Most of the modern dynamic language JIT ideas which came to the fore in the mid 200X's were directly adapting research work on Self from about two decades prior.

Dynamic runtime optimization isn't too hot in research right now, and it never was to be honest. Most of the language theory folks tend to lean more in the type theory direction.

The industry attention too has shifted away. Browsers were cutting edge a while back and there was a lot of investment in core research tech associated with that, but that's shifting more to the AI space now.

Overall the market value prop and the landscape for it just doesn't quite exist yet. Hard things are hard.

DonHopkins · 2026-01-09T18:27:36 1767983256

You nailed it -- the tech enabling JS to match native speed was Self research from the 80s, adapted two decades later. Let me fill in some specifics from people whose papers I highly recommend, and who I've asked questions of and had interesting discussions with!

Vanessa Freudenberg [1], Craig Latta [2], Dave Ungar [3], Dan Ingalls, and Alan Kay had some great historical and fresh insights. Vanessa passed recently -- here's a thread where we discussed these exact issues:

https://news.ycombinator.com/item?id=40917424

Vanessa had this exactly right. I asked her what she thought of using WASM with its new GC support for her SqueakJS [1] Smalltalk VM.

Everyone keeps asking why we don't just target WebAssembly instead of JavaScript. Vanessa's answer -- backed by real systems, not thought experiments -- was: why would you throw away the best dynamic runtime ever built?

To understand why, you need to know where V8 came from -- and it's not where JavaScript came from.

David Ungar and Randall B. Smith created Self [3] in 1986. Self was radical, but the radicalism was in service of simplicity: no classes, just objects with slots. Objects delegate to parent objects -- multiple parents, dynamically added and removed at runtime. That's it.

The Self team -- Ungar, Craig Chambers, Urs Hoelzle, Lars Bak -- invented most of what makes dynamic languages fast: maps (hidden classes), polymorphic inline caches, adaptive optimization, dynamic deoptimization [4], on-stack replacement. Hoelzle's 1992 deoptimization paper blew my mind -- they delivered simplicity AND performance AND debugging.

That team built Strongtalk [5] (high-performance Smalltalk), got acquired by Sun and built HotSpot (why Java got fast), then Lars Bak went to Google and built V8 [6] (why JavaScript got fast). Same playbook: hidden classes, inline caching, tiered compilation. Self's legacy is inside every browser engine.

Brendan Eich claims JavaScript was inspired by Self. This is an exaggeration based on a deep misunderstanding that borders on insult. The whole point of Self was simplicity -- objects with slots, multiple parents, dynamic delegation, everything just another object.

JavaScript took "prototypes" and made them harder than classes: __proto__ vs .prototype (two different things that sound the same), constructor functions you must call with "new" (forget it and "this" binds wrong -- silent corruption), only one constructor per prototype, single inheritance only. And of course == -- type coercion so broken you need a separate === operator to get actual equality. Brendan has a pattern of not understanding equality.

The ES6 "class" syntax was basically an admission that the prototype model was too confusing for anyone to use correctly. They bolted classes back on top -- but it's just syntax sugar over the same broken constructor/prototype mess underneath. Twenty years to arrive back at what Smalltalk had in 1980, except worse.

Self's simplicity was the point. JavaScript's prototype system is more complicated than classes, not less. It's prototype theater. The engines are brilliant -- Self's legacy. The language design fumbled the thing it claimed to borrow.

Vanessa Freudenberg worked for over two decades on live, self-supporting systems [9]. She contributed to Squeak EToys, Scratch, and Lively. She was co-founder of Croquet Corp and principal engineer of the Teatime client/server architecture that makes Croquet's replicated computation work. She brought Alan Kay's vision of computing into browsers and multiplayer worlds.

SqueakJS [7] was her masterpiece -- a bit-compatible Squeak/Smalltalk VM written entirely in JavaScript. Not a port, not a subset -- the real thing, running in your browser, with the image, the debugger, the inspector, live all the way down. It received the Dynamic Languages Symposium Most Notable Paper Award in 2024, ten years after publication [1].

The genius of her approach was the garbage collection integration. It amazed me how she pulled a rabbit out of a hat -- representing Squeak objects as plain JavaScript objects and cooperating with the host GC instead of fighting it. Most VM implementations end up with two garbage collectors in a knife fight over the heap. She made them cooperate through a hybrid scheme that allowed Squeak object enumeration without a dedicated object table. No dueling collectors. Just leverage the machinery you've already paid for.

But it wasn't just technical cleverness -- it was philosophy. She wrote:

"I just love coding and debugging in a dynamic high-level language. The only thing we could potentially gain from WASM is speed, but we would lose a lot in readability, flexibility, and to be honest, fun."

"I'd much rather make the SqueakJS JIT produce code that the JavaScript JIT can optimize well. That would potentially give us more speed than even WASM."

Her guiding principle: do as little as necessary to leverage the enormous engineering achievements in modern JS runtimes [8]. Structure your generated code so the host JIT can optimize it. Don't fight the platform -- ride it.

She was clear-eyed about WASM: yes, it helps for tight inner loops like BitBlt. But for the VM as a whole? You gain some speed and lose readability, flexibility, debuggability, and joy. Bad trade.

This wasn't conservatism. It was confidence.

Vanessa understood that JS-the-engine isn't the enemy -- it's the substrate. Work with it instead of against it, and you can go faster than "native" while keeping the system alive and humane. Keep the debugger working. Keep the image snapshotable. Keep programming joyful. Vanessa knew that, and proved it!

[1] Freudenberg et al. SqueakJS paper (DLS 2014, Most Notable Paper Award 2024). https://freudenbergs.de/vanessa/publications/Freudenberg-201...

[2] Craig Latta, Caffeine. Smalltalk livecoding in the browser. https://thiscontext.com/

[3] Self programming language. Prototype-based OO with multiple inheritance. https://selflanguage.org/

[4] Hoelzle, Chambers & Ungar. Debugging Optimized Code with Dynamic Deoptimization (1992). https://bibliography.selflanguage.org/dynamic-deoptimization...

[5] Strongtalk. High-performance Smalltalk with optional types. http://strongtalk.org/

[6] Lars Bak. Architect of Self VM, Strongtalk, HotSpot, V8. https://en.wikipedia.org/wiki/Lars_Bak_(computer_programmer)

[7] SqueakJS. Bit-compatible Squeak/Smalltalk VM in pure JavaScript. https://squeak.js.org/

[8] SqueakJS JIT design notes. Leveraging the host JS JIT. https://squeak.js.org/docs/jit.md.html

[9] Vanessa Freudenberg. Profile and contributions. https://conf.researchr.org/profile/vanessafreudenberg

pjmlp · 2026-01-09T16:10:13 1767975013

Only if it doesn't make use of dynamic linking, reflection and is written to take advantage of value types.

AOT compilers without PGO data usually tend to perform worse when those conditions aren't met.

Which is why the best of both worlds is using JIT caches that survive execution runs.

anotherhue · 2026-01-09T16:16:56 1767975416

Yeah I've heard this my whole career, and while it sounds great it's been long enough that we'd be able to list some major examples by now.

What are the real world chances that a) one's compiled code benefits strongly from runtime data flow analysis AND b) no one did that analysis at the compilation stage?

Some sort of crazy off label use is the only situation I think qualifies and that's not enough.

IggleSniggle · 2026-01-09T16:31:20 1767976280

Compiled Lua vs LuaJIT is a major example imho, but maybe it's not especially pertinent given the looseness of the Lua language. I do think it demonstrates that the concept that it is possible to have a tighter type-system at runtime than at compile time (that can in turn result in real performant benefits) is a sound concept, however.

drysart · 2026-01-09T18:57:48 1767985068

The major Javascript engines already have the concept of a type system that applies at runtime. Their JITs will learn the 'shapes' of objects that commonly go through hot-path functions and will JIT against those with appropriate bailout paths to slower dynamic implementations in case a value with an unexpected 'shape' ends up being used instead.

There's a lot of lore you pick up with Javascript when you start getting into serious optimization with it; and one of the first things you learn in that area is to avoid changing the shapes of your objects because it invalidates JIT assumptions and results in your code running slower -- even though it's 100% valid Javascript.

IggleSniggle · 2026-01-10T00:28:56 1768004936

Totally agree on js, but it doesn't have the same easy same-language comparison that you get from compiled Lua vs LuaJIT. Although I suppose you could pre-compile JavaScript to a binary with eg QuickJS but I don't think this is as apples-to-apples comparison as compiled Lua to LuaJIT.

xylophile · 2026-01-09T17:18:18 1767979098

Any optimizations discovered at runtime by a JIT can also be applied to precompiled code. The precompiled code is then not spending runtime cycles looking for patterns, or only doing so in the minimally necessary way. So for projects which are maximally sensitive to performance, native will always be capable of outperforming JIT.

It's then just a matter of how your team values runtime performance vs other considerations such as workflow, binary portability, etc. Virtually all projects have an acceptable range of these competing values, which is where JIT shines, in giving you almost all of the performance with much better dev economics.

kannanvijayan · 2026-01-09T17:43:13 1767980593

I think you can capture that constraint as "anything that requires finely deterministic high performance is out of reach of JIT-compiled outputs".

Obviously JITting means you'll have a compiler executing sometimes along with the program which implies a runtime by construction, and some notion of warmup to get to a steady state.

Where I think there's probably untapped opportunity is in identifying these meta-stable situations in program execution. My expectation is that there are execution "modes" that cluster together more finely than static typing would allow you to infer. This would apply to runtimes like wasm too - where the modes of execution would be characterized by the actual clusters of numeric values flowing to different code locations and influencing different code-paths to pick different control flows.

You're right that on the balance of things, trying to say.. allocate registers at runtime will necessarily allow for less optimization scope than doing it prior.

But, if you can be clever enough to identify, at runtime, preferred code-paths with higher resolution than what (generic) PGO allows (because now you can respond to temporal changes in those code-path profiles), then you can actually eliminate entire codepaths from the compiler's consideration. That tends to greatly affect the register pressure (for the better).

It might be interesting just to profile some wasm executions of common programs. If there are transient clusterings of control flow paths that manifest during execution. It'd be a fun exercise...

kannanvijayan · 2025-12-16T12:41:12 1765888872

The American working class doesn't like to acknowledge its own existence or assert its self-worth. There's no real self identity for that class in America.

In fact, a huge number of the people that are in that class would resent you for classifying them in this way. And the same is true for those in the upper middle class, or elites.

Secondly, trying to scope the xenophobia problem to just the working class is itself a bit of a misdirection. Plenty of that comes from the swaths of upper middle class white collar folks. And plenty of it comes from second gen immigrants who are eager to be counted among the natives.

The xenophobia _is_ the substitute American culture provides as a filler for the vacuum left by the lack of any sort of class identity. Everybody falls over themselves demonstrating how they can be "more American" in one way or the other. Who is a "real" American, what their qualities are, whether this particular thing or that particular thing is more or less American, etc. etc.

It's an alternate focus to direct all that shame the culture demands from the poor.

kannanvijayan · 2025-12-08T03:27:38 1765164458

A considerable part of this is the fact that in a society where utilizing these programs is stigmatized to the degree that the USA does, people who see themselves as honest tend to avoid utilizing them.

And even those who are less than honest, but have a sense of propriety, would understand that the correct, culturally approved time to engage in these activities is AFTER one acquires a significant amount of wealth, when entitlements are knighted to become "economic incentives".

paulddraper · 2025-12-08T03:48:34 1765165714

> entitlements are knighted to become "economic incentives"

I have no patience for corporate welfare and bailouts.

kannanvijayan · 2025-12-08T12:00:17 1765195217

I'm sure you don't. But culturally, the things you are permitted to have influence over are the non-corporate welfare and bailouts.

The culture is that you get to speak against both, but only act against one.

kannanvijayan · 2025-11-27T05:22:47 1764220967

I really don't understand what any of this has to do with "trust", especially of the project or code. If anything people who want to gain undeserved trust would be incentivized to appear to follow a higher standard of norms publically. The public comments would be nice and polite and gregarious and professional, and the behaviour that didn't meet that standard would be private.

FWIW I've never programmed a line of code in zig and I don't know who this developer is.

All I got from it was "seems like GitHub is starting to deteriorate pretty hard and this guy's fed up and moving his project and leaving some snark behind".

kannanvijayan · 2025-11-16T04:28:59 1763267339

I think there's still a category theoretic expression of this, but it's not necessarily easy to capture in language type systems.

The notion of `f` producing a lazy sequence of values, `g` consuming them, and possibly that construct getting built up into some closed set of structures - (e.g. sequences, or trees, or if you like dags).

I've only read a smattering of Pi theory, but if I remember correctly it concerns itself more with the behaviour of `f` and `g`, and more generally bridging between local behavioural descriptions of components like `f` and `g` and the global behaviour of a heterogeneous system that is composed of some arbitrary graph of those sending messages to each other.

I'm getting a bit beyond my depth here, but it feels like Pi theory leans more towards operational semantics for reasoning about asynchronicity and something like category theory / monads / arrows and related concepts lean more towards reasoning about combinatorial algebras of computational models.

kannanvijayan · 2025-11-15T15:28:32 1763220512

You'd want to have the alteration reference existing guides to the current implementation.

I haven't jumped in headfirst to the "AI revolution", but I have been systematically evaluating the tooling against various use cases.

The approach that tends to have the best result for me combines a collection of `RFI` (request for implementation) markdown documents to describe the work to be done, as well as "guide" documents.

The guide documents need to keep getting updated as the code changes. I do this manually but probably the more enthusiastic AI workflow users would make this an automated part of their AI workflow.

It's important to keep the guides brief. If they get too long they eat context for no good reason. When LLMs write for humans, they tend to be very descriptive. When generating the guide documents, I always add an instruction to tell the LLM to "be succinct and terse", followed by "don't be verbose". This makes the guides into valuable high-density context documents.

The RFIs are then used in a process. For complex problems, I first get the LLM to generate a design doc, then an implementation plan from that design doc, then finally I ask it to implement it while referencing the RFI, design doc, impl doc, and relevant guide docs as context.

If you're altering the spec, you wouldn't ask it to regen from scratch, but use the guide documents to compute the changes needed to implement the alteration.

I'm using claude code primarily.

kannanvijayan · 2025-11-14T11:19:12 1763119152

Full Time Employee

DanielHB · 2025-11-14T13:16:07 1763126167

Is this a codeword for "not contractor"? I heard that at google contractors are second class citizens.

charcircuit · 2025-11-14T16:52:17 1763139137

>at google contractors are second class citizens

This is the case at many companies to avoid contractors being considered employees.

DanielHB · 2025-11-17T08:53:45 1763369625

Yes, but that is usually more relating to pay/benefits. At google (from what I heard) contractors are put on the bad projects, maintenance work or support functions. As in there is a big separation between work done by full-time employees and contractors in most teams.

ColonelPhantom · 2025-11-14T13:47:28 1763128048

I think FTE is mostly used as a 'unit'. E.g. if two people work on something 50% of the time, you get one "FTE-equivalent", as there is roughly one full-time employee of effort put in.

Though in this context it just seems to be the number of people working on the code on a consistent basis.

dragonwriter · 2025-11-14T17:24:56 1763141096

FTE can mean either:

* “Full Time Employee” (which can itself mean “not a part-timer” in a place that employs both, or “not a temp/contractor” [in which case the “full-time” really means “regular/permanent”]) or

* “Full Time Equivalent” (a budgeting unit equal to either a full time worker or a combination of part time workers with the same aggregate [usually weekly] hours as constitute the standard for full-time in the system being used.)

Ghoelian · 2025-11-14T15:33:23 1763134403

Yeah, 1 FTE just equals 40 work-hours.