For compression and long-running agents, may I suggest https://memtree.dev. We offer a simple API that compresses messages asynchronously for instant responses and small context leading to much higher quality generations. We're about to release a dashboard that will show you what each compressed request looked like, the token distribution between system, memory, and tool messages, along with memory retrievals, etc... Is this the type of thing that you're looking for?
Something like this needs to be open-sourced. You're going to have a hell of a time trying to get enough trust from people to run all of their prompts through your servers.
For code it's actually quite good so far IME. Not quite as good as Gemini 2.5 Pro but much faster. I've integrated it into polychat.co if you want to try it out and compare with other models. I usually ask 2 to 5 models the same question there to reduce the model overload anxiety.
I discuss how the automation wave is already starting with white-collar job openings at a 12 year low in the U.S. I also talk about how we cannot simply count on taxing AI to support automated workers, as countries that don't tax will outcompete those who do. We therefore need international cooperation, based in MAIM, from the original Superintelligence Strategy paper.
I also discuss how bioweapons cannot be avoided via restricting open weight models as originally suggested in Dan's paper. Rather we need to heavily invest in bioweapon defense, and in particular use AI for wastewater monitoring and accelerating metagenomics (detangling mixed DNA).
One tradeoff of bringing your own API keys is that as you add more model providers, you get more billing accounts to deal with. Chorus also doesn't have an incentive to efficiently use your tokens. We save 67% on Anthropic token costs using Claude Caching. We also use cheaper "task models" for conversation titles, tagging, and parts of the RAG pipeline which all drastically cuts token costs.
So it looks like some folks are getting errors with the non-streaming models, i.e. the o1 models. I think their long running cxns with zero packets may cause some networks to drop the requests. Will look into a hearbeat/keepalive on those.
I've added a "Thinking...." which sends server side events to keep the cxn open. Would love to hear if o1 models now work for anyone who they were broken for.
It's not unlimited free unfortunately. After some free use, we provide monthly usage-tier plans. But there are no rate limits like other providers as you can move up to the next usage-tier.
Cool! I should say most of PolyChat is open source at https://github.com/open-webui - just the combo models and payment are closed source right now. Open to arguments on making PolyChat fully open source as well!
Sounds like you made most of these changes upstream? What about the background chats and the chat tree overview, are either of them in Open WebUI, or are they also custom to PolyChat? I run OWUI locally and am interested in those features for selfish reasons. If for some reason your multi-model idea doesn’t pan out, I’d love to see it merged upstream, too. Thanks for your contributions!
(Another annoying thing about OWUI is getting logged out every time the image upgrades… is that something else you’ve looked at?)
Background chats are new in v5 of Open WebUI, so you can use it too. Overview has been there also, but it's kind hidden in the hamburger menu.
The upgrade/logout issue you're facing is likely due to not setting WEBUI_SECRET_KEY outside of your docker container. This causes all previous cookies to be unreadable as a new key will get generated by start.sh and won't decrypt the old cookies.
Wow, this is awesome! Thanks for building. I didn't realize there was a protocol for streaming while rendering, though I noticed sumo.ai doing something similar for audio. Gemini with grounding is new to me also, very nice!