Some context for those who aren't in the loop: ONNX Runtime (https://onnxruntime.ai/) is a standardization format for AI models. Nowadays, it's extremely easy to export models in the ONNX format, especially language models with tools like Hugging Face transformers which have special workflows for it.
ONNX support in the browser was lacking and limited to CPU, but with a WebGPU backend it may now finally be feasible to run models in the browser on a GPU, which opens up interesting oppertunities. Although from this PR it looks like only a few operations are implemented, no browser-based GPT yet.
ONNX is the ML interchange format - it's supposed to be a framework independent way to share ML models and weights. ONNX Runtime is a runtime for running those ONNX format models. And it's quite performant - probably the more performant ML runtimes currently.
>Some context for those who aren't in the loop: ONNX Runtime (https://onnxruntime.ai/) is a standardization format for AI models.
It's just an IR, one of many - every framework has its own.
>Nowadays, it's extremely easy to export models in the ONNX format, especially language models with tools like Hugging Face transformers which have special workflows for it.
Meh it's poorly supported by both PyTorch and TF. Why support Microsoft's IR when you have your own.
>probably the most performant ML runtime at this point.
Not even by a long-shot - first party compilers are generally faster because of smoother interop but even amongst third-party you have TRT and TVM. TBH I have no idea what anyone uses ONNX for these days (legacy?).
I have ask: where do you run your models, and how do you keep them... Deterministic? We have run models in a variety of environments from cloud to edge, and the only way we've found to do that is with ONNX.
Tensorflow has breaking changes in model behaviour on patch versions!
Pytorch and tensorflow require a huge python environment to run, and how on earth do we get that onto a client system without them having virtualisation?
Worst of all, they both have very significant changes in prediction depending on the CPU (or god forbid GPU) on the system.
Once we export to Onnx we find we get reliable output and performance across runtimes, which seems mandatory for running any kind of product.
Your experience matches mine exactly. Onnx seems to be relatively unknown for some reason. I feel like I'm on crazy pills -- how is everyone else delivering ML models? Are they really shipping multi GB pytorch environments? Are they using some sort of rube-goldberg machine to run things off a janky jupyter notebook?
In my experience, yes, all of the above. Azure ML Studio is a Jupyter Notebook Rube Goldberg machine builder. It's not cheap and it is not pretty, but it gets stuff done quickly. You are lucky if some developer took the time to properly nearly-productionize the Jupyter notebook as a Flask app or something (with yeah, a multi-gigabyte container). (Even luckier, in an Azure shop, if they skipped Flask and gave you a Functions app. Not quite fewer dependencies or smaller size, but Functions gives you free Application Insights telemetry and that is an operational beauty.)
Every time I've convinced data scientists to just hand off an ONNX file to me for Production everyone comes out pleased: building an ONNX file is easier than productionizing a notebook any other way; the speed and performance of ONNX runtime are great and it is more easily integrated in C#-based pipelines (potentially avoiding expensive data transfers to/from Python VMs).
The biggest, ugliest hurdle I've seen is how many pre/post-processing steps data scientists tend to accidental convince themselves "can only be done in Python" either because they don't have time to research alternatives, believe Python to be inherently magical, or found some massive, obscure Huggingface-like corpus with gigabytes of data that would result in the most bloated ONNX files and "obscured in a Python or VM install step" hides how big the corpus is.
> Meh it's poorly supported by both PyTorch and TF.
This does not match my experience. Most new model architectures port fine to ONNX from pytorch, only occasionally having to fill in rare functions.
> Not even by a long-shot - first party compilers are generally faster because of smoother interop but even amongst third-party you have TRT and TVM.
Again, in my experience, there is little by way of documentation for any of these platforms nor as broad support for as many OS and Hardware combinations.
> I have no idea what anyone uses ONNX for these days
Huggingface has strong support for ONNX and leverages it for improved performance in places.
It's not poorly supported by pytorch but you do have to have it in mind while writing your code. I've been having a hell of a time getting a decent ONNX model out of some pytorch code that was written with lots of dynamic logic inside the modules, it can be quite some work converting this to the necessary "static graph" design needed to export to ONNX properly.
What I'd love to see is a lean, inference-only version of pytorch that just works with existing code, and is smaller without the overhead of autograd and GPU support.
I was going to answer the same, I find the approach of machine learning compilers that directly compile models to host and device code better than having to bring a huge runtime. There are exciting projects in this area like TVM Unity [1], IREE [2], or torch.export [3]
Nope AITemplate is just a tracing mechanism plus cutlass plus autotuning. You still need a runtime. AIT also works on very few models (it's a replacement for something called "static runtime").
It's surprising to me how many people just shoot from the hip/guess on this stuff. IKYK and if you don't maybe don't guess?
The ONNX format have been develop by Microsoft and Faceboof and is really well supported by pytorch as it was develop for exchange between Pytorch and Caffe2.
The ONNX Runtime is use by lots of software for inference only to run on windows without having to manage specific vendor provider. (fi, WinML runtime and ONNX runtime is the same code)
TVM is also use but currently I have the impression it's more widely use in embed device.
ONNX Runtime enable also provider specific inference code like TensorRT, CUDA, CoreML etc... without having you to change your code
Also WONNX can both be used from native apps (through wgpu it uses Vulkan, DX or Metal) a well as on the web (using WebGPU, WONNX compiled to WebAssembly).
That's an awesome project I didn't know about, and I'll almost certainly use it the next time I need an onnx runtime (just because I want to call it from rust). But I would guess that "more complete and mature" is only true in the narrow scope of "running common networks on webgpu"?
Looks like something a self-taught dev would do (ask how I know), or someone who usually works fully solo. Maybe it's a ML community thing?
I might do something like this if I was in a rush and just needed a save point, especially if I was working on a branch I knew I'd squash and merge with a more meaningful commit message. I certainly wouldn't be proud of it, but I might do it.
The main branch history looks reasonably clean. Looks like they generally squash-on-merge (and probably delete the branch on merge), in that case the detailed commit messages are lost anyway.
Completely valid commit strategy if you ask me (even though I bet some might beg to differ).
If you squash your prs there is no point in spending time on making individual commits nice and isolated or their messages as they'll be aggregated into one anyway.
When you do large PRs you want to checkpoint your work and commits have less meaningful descriptions, it's just next step of steps that you couldn't plan ahead, they're popping up as you go.
The fact that you can do that indicates that they could be simply separate PRs merged to main branch independently instead. It's usually desirable as it avoids longer living PR braches that tend to create conflicts and are harder to review.
It is a good point. In many cases, like solo or small teams the conflicts are not a big problem, but I see what you are saying. I have also been in teams that used feature flags more heavily, and it is a possible solution, too.
As an aside, I love ONNX and the main reason I'm sticking with PyTorch. I was able to develop and train an RL model in Python and then convert it to ONNX and call it from C# production code.
It still took a lot of effort but the final version is very performant and reliable.
ONNX support in the browser was lacking and limited to CPU, but with a WebGPU backend it may now finally be feasible to run models in the browser on a GPU, which opens up interesting oppertunities. Although from this PR it looks like only a few operations are implemented, no browser-based GPT yet.