I'm biased (work at Cognition) but I think it's worth giving the Windsurf JetBrains plugin a try. We're working harder on polish these days, so happy to hear any feedback.
We're working on AI tools for developers (autocomplete, chat, and more unannounced things). We train our own LLMs from scratch and have over 1M downloads across our surfaces. We have many paying enterprise customers and have raised a total of $93M from Kleiner Perkins, Greenoaks, and Founders Fund.
We're hiring for many roles, but in particular are looking for software generalists and Deployed Engineers, which are more heavy on the customer interaction than code (https://jobs.ashbyhq.com/codeium/fd2ca49f-ae99-487c-8a52-75d...). No ML or systems experience required. We also have fall software engineering internships available.
You can see all the open roles and apply here (we pretty much look at every single application): https://codeium.com/careers
Hey swyx :) Great question, we've got a blog post coming soon with some of these technical and other details... we've employed a lot of tricks here, debouncing included.
Let's just take the topic of measuring GPU usage. This alone is quite tricky -- tools like nvidia-smi will show full GPU utilization even if not all SMs are running. And also the workload may change behavior over time, if for instance inputs to transformers got longer over time. And then it gets even more complicated to measure when considering optimizations like dynamic batching. I think if you peek into some ML Ops communities you can get a flavor of these nuances, but not sure if there are good exhaustive guides around right now.
I empathize a bit with the cloud providers as they have to upgrade their data centers every few years with new GPU instances and it's hard for them to anticipate demand.
But if you can easily use every trick in the book (CPU version of the model, autoscaling to zero, model compilation, keeping inference in your own VPC, using spot instances, etc.) then it's usually still worth it.
SWE at Exafunction here! We're not ready to be self-serve, so the contact process is the most straightforward for us to work with companies right now. But we respond fast :)
As to the tech, we have APIs closely resembling common deep learning frameworks, so once you add our Python/C++ client locally, you can change a small amount of code to start remotely using GPUs. We also have the ability to handle arbitrary stateful CUDA code for more complex use cases. On the server side, you can deploy our work scheduler inside your own VPC, so we take over orchestration for you as well.
Our customers are currently confidential, but safe to say we've seen a 5-10x decrease in cloud costs (or equivalently, the ability to fit 5-10x larger workloads given a GPU quota). It really depends on the utilization of your current workload.
Makes sense. I think it's less about speed of response and more that for many SWEs, we don't necessarily have the authority to reach out and create relationships between our companies and another vendor. I certainly am not going to bother trying to get that process started unless I can do a little legwork before involving other parties at my company.
That's a great point. We've been mostly outbound so far, but this will be a bigger issue for us moving forward. We're thinking about how to lower the barrier to entry for this -- for instance, we can try to publicly release some docs and let anyone try out the system in a local container.