Inference is all about reducing the friction associated with deployment. Set up a Docker container (which works on a range of architectures, with TRT acceleration), load your model (or use SAM or CLIP, which come out of the box), then you can start sending HTTP requests to get predictions.
I'd love to know more about how we can improve the docs (I'm a big documentarian; feedback on docs is always sincerely appreciated).
We'll work on adding this to the repo, but I'll try to summarize here.
Why should I use this? If you're looking to deploy computer vision models to production (on edge devices or self-hosting in the cloud), Roboflow Inference is the easiest way to get up and running quickly & to iteratively improve your model's performance over time.
It's also in very active use with many real-world customers in production. That means we've ironed out (and are continuing to find and fix) many of the bugs and edge-cases that you'd invariably encounter building something on your own or using something less mature.
When should you _not_ use this? For everything else.
* If you're prototyping & hacking on core model primitives there are likely too many layers of abstraction & you should wait to productionize the model until you're happy with the underlying architecture.
* If you're not doing computer vision you're probably better off with something focused on the specific needs of LLMs or other types of models.
* If you need the absolute highest speed possible you may want to use lower-level tools like DeepStream and Triton (for now). Out of the box we sacrifice a bit of speed to get a better interface & end-user experience. We do have a TensorRT provider that sacrifices some of that convenience for speed & are working on additional optimizations (including potentially integrating with DeepStream/Triton behind the scenes).
Inference is all about reducing the friction associated with deployment. Set up a Docker container (which works on a range of architectures, with TRT acceleration), load your model (or use SAM or CLIP, which come out of the box), then you can start sending HTTP requests to get predictions.
I'd love to know more about how we can improve the docs (I'm a big documentarian; feedback on docs is always sincerely appreciated).