You're exactly right. The llguidance library [1,2] seems to have emerged as the go-to solution for this by virtue of being >10X faster than its competition. It's work from some past colleagues of mine at Microsoft Research based on theory of (regex) derivatives, which we perviously used to ship a novel kind of regex engine for .NET. It's cool work and AFAIK should ensure full adherence to a JSON grammar.
llguidance is used in vLLM, SGLang, internally at OpenAI and elsewhere. At the same time, I also see a non-trivial JSON error rate from Gemini models in large scale synthetic generations, so perhaps Google hasn't seen the "llight" yet and are using something less principled.
Cool stuff! I don't get how all the open source inference framework have this down but the big labs doesn't...
Gemini [0] is falsely advertising this:
> This capability guarantees predictable and parsable results, ensures format and type-safety, enables the programmatic detection of refusals, and simplifies prompting.
I live half way across the world from my folks so I don’t see them often. I’d love something that gives me a greater sense of presence than a video call can give.
By firsthand reports of AVP users, it is.. apparently it feels like a real presence in your space, like hanging out in person.. and their recollections of the conversations weren’t of calls but of visits. The main downside being that there are so few other friend/family users, because it’s prohibitively expensive, niche, and geeky, and those that do these VR calls still do them infrequently because it’s a hassle to break out the device if you don’t use it regularly, and uncomfortable to wear for too long, let alone they typically need to coordinate calls in advance.
Still, if I were to have a long-distance relationship with a tolerant partner, or one of us traveled frequently or for long periods, I would be tempted to consider these so we could watch a show or movie and hang out despite the distance.
We don’t know that LLMs generating tokens for scenarios involving simulations of conscious don’t already involve such experience. Certainly such threads of consciousness would currently be much less coherent and fleeting than the human experience, but I see no reason to simply ignore the possibility. To whatever degree it is even coherent to talk about the conscious experience of others than yourself (p-zombies and such), I expect that as AIs’ long term coherency improves and AI minds become more tangible to us, people will settle into the same implicit assumption afforded to fellow humans that there is consciousness behind the cognition.
The very tricky part then is to ask if the consciousness/phenomenological experience that you postulate still happens if, say, we were to compute the outputs of an LLM by hand… while difficult, if every single person on earth did one operation per second, plus some very complicated coordination and results gathering, we could probably predict a couple of tokens for an LLM at some moderate frequency… say, a couple of tokens a month? a week? A year? A decade? Regardless… would that consciousness still have an experience? Or is there some threshold of speed and coherence, or coloration that would be missing and result in failure for it to emerge?
Impossible to answer.
Btw I mostly think it’s reasonable to think that there might be consciousness, phenomenology etc are possible in silicon, but it’s tricky and unverifiable ofc.
> would that consciousness still have an experience?
If the original one did, then yes, of course. You're performing the exact same processing.
Imagine if instead of an LLM the billions of people instead simulated a human brain. Would that human brain experience consciousness? Of course it would, otherwise they're not simulating the whole brain. The individual humans performing the simulation are now comparable to the individual neurons in a real brain. Similarly, in your scenario, the humans are just the computer hardware running the LLM. Apart from that it's the same LLM. Anything that the original LLM experiences, the simulated one does too, otherwise they're not simulating it fully.
You can simulate as much of the human as you need to. So long as consciousness is a physical process (or an emergent property of a physical process), it can be simulated.
The notion that it is not a physical process is an extraordinary claim in its own right, which itself requires evidence.
You can simulate as much of an aircraft as you need to. So long as flying is a physical process, it can be simulated.
But your simulation will never fly you over an ocean, it will never be an aircraft or do what aircraft do. A simulation of heat transfer will not cook your dinner. A simulation of Your assumption that a simulation of a mind is a mind, requires evidence.
> But your simulation will never fly you over an ocean
It will fly over a simulated ocean just fine. It does exactly what aircraft do, within the simulation. By adding “you” to the sentence you've made it an apples to oranges comparison because “you” is definitionally not part of the simulation. I don't see how you could add the same “you” to “it will simulate consciousness just fine”.
It doesn't move real Oxygen and Nitrogen atoms, it doesn't put exhaust gas into the air over the ocean, it doesn't create a rippling sound and pressure wave for a thousand miles behind it, it doesn't drain a certain amount of jet fuel from the supply chain or put a certain amount of money in airline and mechanics' pockets, it doesn't create a certain amount of work for air traffic controllers... reductio ad abusurdum is that a flipbook animation of a stickman aircraft moving over a wiggly line ocean is a very low granularity simulation and "does exactly what aircraft do" - and obviously it doesn't. No amount of adding detail to the simulation moves it one inch closer to doing 'exactly what aircraft do'.
> "I don't see how you could add the same “you” to “it will simulate consciousness just fine”"
by the same reductio-ad-absurdum I don't see how you can reject a stickman with a speech bubble drawn over his head as being "a low granularity simulated consciousness". More paper, more pencil graphite, and the stickman will become conscious when there's enough of it. Another position is that adding things to the simulation won't simulate consciousness just fine - won't move it an inch closer to being conscious; it will always be a puppet of the simulator, animated by the puppeteer's code, always wooden Pinocchio and never a real person. What is the difference between these two:
a) a machine with heat and light and pressure sensors, running some code, responding to the state of the world around it.
b) a machine with heat and light and pressure sensors, running some code [converting the inputs to put them into a simulation, executing the simulation, converting the outputs from the simulation], and using those outputs to respond to the state of the world around it.
? What is the 'simluate consciousness' doing here at all, why is it needed? To hide the flaw in the argument; it's needed to set up the "cow == perfectly spherical massless simulated cow" premise which makes the argument work in English words. Instead of saying something meaningful about consciousness, one states that "consciousness is indistinguishable from perfectly spherical massless simulated consiousness" and then states "simply simulate it to as much detail as needed" and that allows all the details to be handwaved away behind "just simulate it even more (bro)".
Pointing out that simulations are not the real thing is the counter-argument. Whether or not the counter-argument can be made by putting "you" into a specific English sentence is not really relevant, that's only to show that the simulated aircraft doesn't do what the real aircraft does. A simulated aircraft flying over a simulated ocean is no more 'real' than drawing two stick figures having a conversation in speech bubbles.
You just wrote a lot of text just to say that you don't accept the simulation as “real”.
That's just semantics. I'm not here to argue what the word “real” means. Of course you can define it in such a way that the simulated aircraft isn't “really” flying over an ocean, and it would be just as valid as any other definition, but it doesn't say anything meaningful or insightful about the simulation.
Nobody contests your point that the simulated aircraft isn't going over a real ocean and isn't generating work for real-life air traffic controllers. But conversely you don't seem to contest the claim that oceans and air traffic controllers could be simulated, too. Therefore, consciousness can be simulated as well, and it would be a simulated consciousness that just doesn't fall into your definition of “real”.
You need to clearly define what constitutes "real" before we can meaningfully talk about the distinction between "real" atoms and simulated ones.
As far as physics go, it's all just numbers in the end. Indeed, the more we keep digging into the nature of reality, the more information theory keeps popping up - see e.g. the holographic principle.
> "As far as physics go, it's all just numbers in the end."
No it isn't; numbers are a map, maps are not the territory. You are asking me to define how a map is different from a city, but you are not accepting that the city is made of concrete and is square kilometers large and the map is made of paper and is square centimeters large as a meaningful difference, when I think it's such an obvious difference it's difficult to put any more clearly.
What constitutes a real atom: a Hydrogen atom capable of combining with Oxygen to make water, capable of being affected by the magnetic field of an MRI scanner, etc.
What constitutes a simulated atom: a pattern of bits/ink/numbers which you say "this is a representation of a Hydrogen atom", capable of nothing, except you putting some more bits/ink/numbers near it and speaking the words "this is it interacting to make simulated water".
Ok, you are saying that a map is different than the territory. That a simulation is meaningfully different.
Do you deny that you could be in a simulation right now, in the matrix? What you actually think are are molecules of oxygen are actually simulated molecules. That there is no way for you to every tell the difference.
Is simulate the right word there? With a hundred trillion connections between 80 billion neurons, it seems unlikely that it would ever be worth simulating a human brain, because it would be simpler to just build one than to assemble a computer complex enough to simulate it.
Yes that’s my main point - if you accept the first one, then you should accept the second one (though some people might find the second so absurd as to reject the first).
> Imagine if instead of an LLM the billions of people instead simulated a human brain. Would that human brain experience consciousness? Of course it would, otherwise they're not simulating the whole brain.
However, I don’t really buy “of course it would,” or in another words the materialist premise - maybe yes, maybe no, but I don’t think there’s anything definitive on the matter of materialism in philosophy of mind. as much as I wish I was fully a materialist, I can never fully internalize how sentience can uh emerge from matter… in other words, to some extent I feel that my own sentience is fundamentally incompatible with everything I know about science, which uh sucks, because I definitely don’t believe in dualism!
It would certainly with sufficient accuracy honestly say to you that it's conscious and believes it whole heartily, but in practice it would need to a priori be able describe external sense data, as it's not separate necessarily from the experiences, which intrinsically requires you to compute in the world itself otherwise it would only be able to compute on, in a way it's like having edge compute at the skins edge. The range of qualia available at each moment will be distinct to each experiencer with the senses available, and there likely will be some overlap in interpretation based on your computing substrate.
We in a way can articulate the underlying chemputation of the universe mediated through our senses, reflection and language, turn a piece off (as it is often non continuous) and the quality of the experience changes.
But do you believe in something constructive? Do you agree with Searle that computers calculate? But then numbers and calculation are immaterial things that emerge from matter?
For some interesting context: this paper was a precursor to all the work on synthetic data at Microsoft Research that lead to the Phi series of SLMs. [1] It was an important demonstration of what carefully curated and clean data could do for language models.
Edge's Password Monitor feature uses homomorphic encryption to match passwords against a database of leaks without revealing anything about those passwords: https://www.microsoft.com/en-us/research/blog/password-monit... So not the first, but definitely cool to see more adoption!
I'm not familiar with how TypeChat works, but Guidance [1] is another similar project that can actually integrate into the token sampling to enforce formats.
We built the new engine behind .NET's RegexOptions.NonBacktracking with derivatives. We will have a paper at PLDI this year on the work that went into that.
PCRE semantics was indeed the big thing that required new techniques. Basically, you have to modify the derivation function such that the derivatives model what paths through the pattern a backtracking engine would consider before stopping at the current position.
The big thing derivatives buys you is the ability to apply rewrite rules lazily during the construction. For example, when we can prove that a regex R subsumes a regex T, then an alternation R|T can be rewritten to just R, since T is already included in it. These kinds of rewrites often result in the DFA that gets constructed being minimal or close to so. Of course, you do pay somewhat for the machinery to do this, so best-case construction times suffer compared to in traditional NFA+lazy-DFA engines like RE2 or Rust's, but a larger class of patterns get to stay in the DFA world with derivatives.
I hope our work ignites a new interest in regex matching with derivatives. I believe the ability to apply these syntactic rewrites on-the-fly is really powerful and I'd love to see how far folks like you with extensive experience in optimizing regex matchers can take this.
Wow, that's really cool. I can't wait to read the paper. Have y'all written anything else about it?
> For example, when we can prove that a regex R subsumes a regex T, then an alternation R|T can be rewritten to just R, since T is already included in it.
This doesn't compose though, right? For example, if you have `sam|samwise`, then you can do that transformation, but if you have `\b(?:sam|samwise)\b` then you can't.
> but a larger class of patterns get to stay in the DFA world with derivatives.
We have an early tool paper [1] for a previous version of the engine, but that's short and with POSIX semantics, so doesn't include a lot of the interesting stuff. The most relevant bit there is the handling of Unicode.
>This doesn't compose though, right? For example, if you have `sam|samwise`, then you can do that transformation, but if you have `\b(?:sam|samwise)\b` then you can't.
You'd get subsumption when you have something like '(?:sam)?wise|wise' and in fact this kind of subsumption due to a "nullable prefix" is the main one currently detected (because we encountered patterns that motivated it). And you're right that all these rewrites should compose regardless of context so that they can be eagerly applied in the AST constructors.
>> but a larger class of patterns get to stay in the DFA world with derivatives.
>Can you say more about this?
Yeah, the easiest example I can point at is from that tool paper [1], where a subsumption-based rewrite for counted repetitions can turn an exponential blow-up into a linear one. Off the top of my head I think a pattern like '.*a[ab]{0,1000}' would have a 2^1000 blow-up when determinized into a DFA but stays linear in size with derivatives. However, that loop subsumption rule isn't valid as-is under PCRE semantics, so it still needs some work to be ported to the .NET engine.
Before we get that PLDI paper out the best resource is probably just the code under [2]. It's fairly well commented, but of course that's no substitute for a good write-up.
I'd love for someone to do a good quality PyTorch enabled implementation of Sampled AlphaZero/MuZero [1]. RLLib has an AlphaZero, but it doesn't have the parallelized MCTS you really want to have and the "Sampled" part is another twist to it. It does implement a single player variant though, which I needed. This would be amazing for applying MCTS based RL to various hard combinatorial optimization problems. Case in point, AlphaTensor uses their internal implementation of Sampled AlphaZero.
An initial implementation might be doable in 5 hours for someone competent and familiar with RLLib's APIs, but could take much longer to really polish.
The in-place update work in Koka [1] is super impressive. One of my co-workers, Daan Leijen leads the Koka project and hearing his talks about it have been such a privilege. The work around Koka is really convincing me that functional languages will eventually lead the pack in the effort-to-performance trade-off.
Something that came out of the Koka project that everyone should know about is mimalloc [2]: if your built-in malloc is not doing it for you, this is the alternative allocator you should try first. Mimalloc is tiny and it has consistently great performance in concurrent environments.
Ooooh… Koka is a really neat language. I first encountered it in a seminar when we talked about algebraic effect handlers. Koka has these row types which effectively allow you to statically ensure that all your effects get handled at some point, modulo certain ordering constraints on when those effects get handled.
Essentially, you get nice purity around effects but they're way easier to compose (and imo grok) than monads.
If anyone is interested in learning more, definitely take a look at this [1] paper by the aforementioned Leijen. (OP, very cool that you get to work with this guy.) One of the best-written papers I've seen.
I'm really excited for koka. I still need to dig in and actually try writing some code but it looks so promising. At a minimum I hope that it inspires other languages to look at these features.
There's definitely bits I don't love, but they're nothing I couldn't get over.
I actually attempted some of 2021's Advent of Code in Koka.
It was fun for the first couple of questions, but then got really painful for a couple of reasons. None of this is a complaint on the project, as it is still a research language and has not had a lot of effort invested into getting others to use it.
What I really liked was how easy it was to write parsers due to the effect system. i.e. you'd write a parser that looked like a direct imperative translation of the BNF grammar or the problem statement, and then the effect system would propagate the actual input through the call graph and each parser would only have to deal with its small part.
However several things made me get frustrated and stop which wouldn't have happened with better documentation and more source code to look at.
As someone who has no theory background in effect systems, the syntax was very oblique. i.e. when do I need to mark that my function needs to have a `| e` effect or a `, e` effect. I'm not a fan of the Haskell style of introducing single-letter free variables without any syntactic convention. For example, in Rust, the tick is used to clearly mark that something is a lifetime annotation. The rules about how lifetimes propagate, and so when you need to introduce 2 or more lifetimes are explained clearly (if not in the Rust docs, then somewhere else on the internet). That wasn't the case for Koka's effect syntax. In addition, I would sometimes stumble onto effect annotations that should've worked according to my limited understanding, but didn't. The compiler errors in such cases were really meaningless to me. Other times, I feel like the errors almost indicated that what I was doing was semantically correct, but was not yet implemented in the compiler. Also apparently the `_` has special meaning at the beginning of an effect annotation, but I only found this out from trial and error. Eventually I would just give up on annotating any functions until the compiler complained, but this made it difficult for me to understand my own code (it was like writing Python without type annotations).
The API docs are mostly just type signatures, which means unless it is some simple function, or it is used in the koka implementation sources, you are left scratching your head about how to use the standard library. Things like `ref()` types could also use much more elaboration.
Also, at the end of the day, Advent of Code solutions are generally pretty small scripts, where there isn't a need for effect systems. It was often easier to just use Python and its solid standard library.
Overall, it is an interesting language and I'll definitely keep an eye on it.
I've done work on low-latency FHE neural network inferencing [1] and we estimated the FLOPS to be approximately 4 times that of an Intel 8087 floating point coprocessor [2]. This was for a LeNet-5 network on the MNIST dataset with multicore evaluation on a workstation class machine.
My view is that this is already fast enough to support use cases that really need the unique capabilities of FHE. Since this work we've been focused on making FHE more usable with compilers and tooling [3]. Currently most FHE is being programmed like that Intel 8087 was: with the equivalent of assembly by directly calling functions in FHE libraries to perform arithmetic and crypto operations. Imagine having to do register allocation by hand for all of your code. The EVA compiler [4] is meant to be like a "C compiler for FHE", hiding low-level crypto concerns and providing common optimizations.
llguidance is used in vLLM, SGLang, internally at OpenAI and elsewhere. At the same time, I also see a non-trivial JSON error rate from Gemini models in large scale synthetic generations, so perhaps Google hasn't seen the "llight" yet and are using something less principled.
1: https://guidance-ai.github.io/llguidance/llg-go-brrr 2: https://github.com/guidance-ai/llguidance