Built a browser-based AI OS where a master agent creates/evolves specialized sub-agents defined in markdown, executes Python via WebAssembly, and learns from past executions via persistent memory.
Key features:
- Agent reuse & evolution (80% match rule)
- Python runtime in browser (Pyodide: numpy, scipy, matplotlib)
- Memory system that improves over time
- Virtual file system (localStorage)
- Completely client-side
Example: Ask for "FFT signal analysis" → system checks memory → finds/evolves SignalProcessorAgent → generates Python → executes in browser → saves results → records experience → next time runs in seconds.
I love Claude Code, but there are a few capabilities I kept wishing it had. So I built an experimental fork/extension to explore what those
might look like.
Three main additions I wanted:
1. Persistent Domain Memory
Claude Code starts fresh each session. I wanted an environment that remembers domain-specific patterns. LLMos adds a three-volume system
(System/Team/User) where successful workflows automatically become reusable skills. Work on quantum chemistry for a week, and the system
learns molecular Hamiltonians, ansatz selection heuristics, convergence criteria—domain fluency that compounds over time.
2. Self-Improving Sub-Agents
Claude Code has great tool use, but I wanted agents that could observe and improve themselves. LLMos agents literally rewrite their own
code based on what works. Example: A circuit optimizer starts basic, but after 50+ sessions, it's learned adaptive gradient descent, smart
initialization, and error mitigation strategies—all from watching successful runs.
3. Client-Side Code Execution
Claude Code writes files but doesn't run them directly. I added Pyodide for browser-based Python execution with live preview. Edit code →
auto-run → see matplotlib plots/quantum circuits in <1 second. No deployment, just pure flow state for scientific computing.
Current focus: Quantum computing (VQE, QAOA, quantum chemistry) because it's a perfect test bed—rapidly evolving field, requires deep
domain expertise, complex workflows, high-value automation.
The "evolving OS" concept: Instead of a static tool, what if your development environment learned your field, extracted patterns into
reusable skills, and improved its agents based on what actually works in practice?
Technical: Next.js + Pyodide + Qiskit + OpenRouter. All volumes are Git repos (preserving Claude Code's file-first philosophy). Code
execution is 100% client-side.
GitHub: https://github.com/agustinazwiener/evolving-agents-labs/tree/main/llmunix
Obviously this is rough/experimental—missing lots of polish, limited to Python, quantum-focused. But I'm curious:
- Would persistent domain memory be useful in Claude Code itself?
- Are self-modifying agents too weird, or genuinely helpful?
- Is browser-based execution worth the complexity for scientific/research workflows?
Feedback welcome, especially from Claude Code users or anyone working in specialized technical domains.
When Linus posted Linux 0.01 in 1991, he wrote: "I'm doing a (free) operating system (just a hobby, won't be big and
professional)." It wasn't complete. It wasn't polished. But the core ideas were there.
I've been thinking about what an "operating system" for LLMs would look like. Not an agent framework – an actual OS with
memory hierarchies, execution modes, and something I'm calling a "Sentience Layer."
LLM OS v3.4.0 is my attempt. It's incomplete and probably over-ambitious, but the architecture is interesting:
Four-Layer Stack:
- Sentience Layer – Persistent internal state (valence variables: safety, curiosity, energy, confidence) that influences
behavior. The system develops "moods" based on task outcomes.
- Learning Layer – Five execution modes (CRYSTALLIZED → FOLLOWER → MIXED → LEARNER → ORCHESTRATOR) based on semantic trace
matching
- Execution Layer – Programmatic Tool Calling for 90%+ token savings on repeated patterns
- Self-Modification Layer – System writes its own agents (Markdown) and crystallizes patterns into Python
What makes it different:
- Agents are Markdown files the LLM can edit (hot-reloadable, no restart)
- Traces store full tool calls for zero-context replay
- Repeated patterns become pure Python (truly $0 cost)
- Internal state persists across sessions and influences mode selection
Working examples:
- Quantum computing IDE backend (Qiskit Studio)
- Educational platform for kids (Q-Kids Studio)
- Robot control with safety hooks (RoboOS)
Is it production-ready? No. Will it work as envisioned? I'm figuring that out. But the ideas feel right, and building it is
genuinely fun.
GitHub: https://github.com/EvolvingAgentsLabs/llm-os
Looking for feedback on the architecture, collaboration on making it actually work, and honest criticism. What's missing?
What's overengineered? What would you want from an LLM OS?
I'm working on LLM OS, an experimental project that explores treating the LLM as a CPU and Python as the kernel. The goal is to provide OS-level services—like memory hierarchy, scheduler hooks, and security controls—to agentic workflows using the Claude Agent SDK.
Right now, this is mostly a collection of architectural ideas and prototypes rather than a polished framework. I’ve included several complex examples in the repo to explore the potential of this approach:
- Qiskit Studio Backend: Re-imagining a microservices architecture as a unified OS process for quantum computing tasks.
- Q-Kids Studio: Exploring how an OS layer can manage safety, adaptive difficulty, and state in an educational app.
- RoboOS: Testing how kernel-level security hooks can enforce physical safety constraints on a robot arm.
These examples play with concepts like execution caching (Learner/Follower modes) and multi-agent orchestration, but the project is very much in the early stages and is not yet functional for production.
I’m sharing this early because I believe the "LLM as OS" analogy has a lot of potential. I'm looking for contributors and feedback to help turn these concepts into a functional reality.
Most agent frameworks struggle with long-term, consolidated memory. They either have a limited context window or use simple RAG, but there's no real process for experience to become institutional knowledge.
Inspired by the recent Google Research paper "Nested Learning: The Illusion of Deep Learning Architectures", we've implemented a practical version of its "Continuum Memory System" (CMS) in our open-source agent framework, LLMunix.
The idea is to create a memory hierarchy with different update frequencies, analogous to brain waves, where memories "cool down" and become more stable over time.
Our implementation is entirely file-based and uses Markdown with YAML frontmatter (no databases):
High-Frequency Memory (Gamma):
Raw agent interaction logs and workspace state from every execution. Highly volatile, short retention. (/projects/{ProjectName}/memory/short_term/)
Mid-Frequency Memory (Beta):
Successful, deterministic workflows distilled into execution_trace.md files. These are created by a consolidation agent when a novel task is solved effectively. Much more stable. (/projects/{ProjectName}/memory/long_term/)
Low-Frequency Memory (Alpha):
Core patterns that have been proven reliable across many contexts and projects. Stored in system-wide logs and libraries. (/system/memory_log.md)
Ultra-Low-Frequency Memory (Delta):
Foundational knowledge that forms the system's identity. (/system/SmartLibrary.md)
A new ContinuumMemoryAgent orchestrates this process, automatically analyzing high-frequency memories and deciding what gets promoted to a more stable, lower-frequency tier.
This enables:
Continual Learning: The system gets better and more efficient at tasks without retraining, as successful patterns are identified and hardened into reusable traces.
No Catastrophic Forgetting: Proven, stable knowledge in low-frequency tiers isn't overwritten by new, transient experiences.
Full Explainability:
The entire learning process is human-readable and version-controllable in Git, since it's all just Markdown files.
The idea was originally sparked by a discussion with Ismael Faro about how to build systems that truly learn from doing.
We'd love to get your feedback on this architectural approach to agent memory and learning.
We made LLMunix - an experimental system where you define AI agents in markdown once, then a local model executes them. No API calls after setup.
The strange part: it also generates mobile apps. Some are tiny, some bundle local LLMs for offline reasoning. They run completely on-device.
Everything is pure markdown specs. The "OS" boots when an LLM runtime reads the files and interprets them.
Still figuring out where this breaks. Edge models are less accurate. Apps with local AI are 600MB+. Probably lots of edge cases we haven't hit.
But the idea is interesting: what if workflows could learn and improve locally? What if apps reasoned on your device instead of the cloud?
Try it if you're curious. Break it if you can. Genuinely want to know what we're missing.
What would you build with fully offline AI?
• Have a 2GB local model execute it daily with actual reasoning
• Generate production mobile apps with on-device AI
• All for zero marginal cost
...I would've said "maybe in 5 years."
We built it. It's called LLMunix.
What if you could describe any mobile app
- "personal trainer that adapts,"
- "study assistant that quizzes me"
- and get a working prototype with on-device AI in minutes, not months?
What if every workflow you do more than once becomes an agent that improves each time?
What if AI ran locally, privately, adapting to you - not in the cloud adapting to everyone?
I wanted to share a project I've been refining, called llmunix-starter. I've always been fascinated by the idea of AI systems that can adapt and build what they need, rather than relying on a fixed set of pre-built tools. This is my attempt at exploring that.
The template is basically an "empty factory." When you give it a complex goal through Claude Code on the web (which is great for this because it can run for hours), it doesn't look for existing agents. Instead, it writes the markdown definitions for a new, custom team of specialists on the fly.
For example, we tested it on a university bioengineering problem and it created a VisionaryAgent, a MathematicianAgent, and a QuantumEngineerAgent from scratch. The cool part was when we gave it a totally different problem (geological surveying), it queried its "memory" of the first project and adapted the successful patterns, reusing about 90% of the core logic.
I think it's particularly useful for those weird, messy problems where a generic agent just wouldn't have the context—like refactoring a legacy codebase or exploring a niche scientific field.
Key features: - Agent reuse & evolution (80% match rule) - Python runtime in browser (Pyodide: numpy, scipy, matplotlib) - Memory system that improves over time - Virtual file system (localStorage) - Completely client-side
Example: Ask for "FFT signal analysis" → system checks memory → finds/evolves SignalProcessorAgent → generates Python → executes in browser → saves results → records experience → next time runs in seconds.
Try it: https://github.com/EvolvingAgentsLabs/llmos
Started as a weekend project exploring self-improving AI systems. Core features working, some rough edges.
Feedback welcome, especially on the agent evolution approach and memory structure.