ONNX Runtime merges WebGPU backend

minimaxir · on April 25, 2023

Some context for those who aren't in the loop: ONNX Runtime (https://onnxruntime.ai/) is a standardization format for AI models. Nowadays, it's extremely easy to export models in the ONNX format, especially language models with tools like Hugging Face transformers which have special workflows for it.

ONNX support in the browser was lacking and limited to CPU, but with a WebGPU backend it may now finally be feasible to run models in the browser on a GPU, which opens up interesting oppertunities. Although from this PR it looks like only a few operations are implemented, no browser-based GPT yet.

UncleOxidant · on April 25, 2023

ONNX is the ML interchange format - it's supposed to be a framework independent way to share ML models and weights. ONNX Runtime is a runtime for running those ONNX format models. And it's quite performant - probably the more performant ML runtimes currently.

mathisfun123 · on April 25, 2023

Gonna respond here and correct both comments

>Some context for those who aren't in the loop: ONNX Runtime (https://onnxruntime.ai/) is a standardization format for AI models.

It's just an IR, one of many - every framework has its own.

>Nowadays, it's extremely easy to export models in the ONNX format, especially language models with tools like Hugging Face transformers which have special workflows for it.

Meh it's poorly supported by both PyTorch and TF. Why support Microsoft's IR when you have your own.

>probably the most performant ML runtime at this point.

Not even by a long-shot - first party compilers are generally faster because of smoother interop but even amongst third-party you have TRT and TVM. TBH I have no idea what anyone uses ONNX for these days (legacy?).

Dayshine · on April 25, 2023

I have ask: where do you run your models, and how do you keep them... Deterministic? We have run models in a variety of environments from cloud to edge, and the only way we've found to do that is with ONNX.

Tensorflow has breaking changes in model behaviour on patch versions!

Pytorch and tensorflow require a huge python environment to run, and how on earth do we get that onto a client system without them having virtualisation?

Worst of all, they both have very significant changes in prediction depending on the CPU (or god forbid GPU) on the system.

Once we export to Onnx we find we get reliable output and performance across runtimes, which seems mandatory for running any kind of product.

Boxxed · on April 25, 2023

Your experience matches mine exactly. Onnx seems to be relatively unknown for some reason. I feel like I'm on crazy pills -- how is everyone else delivering ML models? Are they really shipping multi GB pytorch environments? Are they using some sort of rube-goldberg machine to run things off a janky jupyter notebook?

WorldMaker · on April 25, 2023

In my experience, yes, all of the above. Azure ML Studio is a Jupyter Notebook Rube Goldberg machine builder. It's not cheap and it is not pretty, but it gets stuff done quickly. You are lucky if some developer took the time to properly nearly-productionize the Jupyter notebook as a Flask app or something (with yeah, a multi-gigabyte container). (Even luckier, in an Azure shop, if they skipped Flask and gave you a Functions app. Not quite fewer dependencies or smaller size, but Functions gives you free Application Insights telemetry and that is an operational beauty.)

Every time I've convinced data scientists to just hand off an ONNX file to me for Production everyone comes out pleased: building an ONNX file is easier than productionizing a notebook any other way; the speed and performance of ONNX runtime are great and it is more easily integrated in C#-based pipelines (potentially avoiding expensive data transfers to/from Python VMs).

The biggest, ugliest hurdle I've seen is how many pre/post-processing steps data scientists tend to accidental convince themselves "can only be done in Python" either because they don't have time to research alternatives, believe Python to be inherently magical, or found some massive, obscure Huggingface-like corpus with gigabytes of data that would result in the most bloated ONNX files and "obscured in a Python or VM install step" hides how big the corpus is.

Vetch · on April 25, 2023

> Meh it's poorly supported by both PyTorch and TF.

This does not match my experience. Most new model architectures port fine to ONNX from pytorch, only occasionally having to fill in rare functions.

> Not even by a long-shot - first party compilers are generally faster because of smoother interop but even amongst third-party you have TRT and TVM.

Again, in my experience, there is little by way of documentation for any of these platforms nor as broad support for as many OS and Hardware combinations.

> I have no idea what anyone uses ONNX for these days

Huggingface has strong support for ONNX and leverages it for improved performance in places.

https://huggingface.co/docs/optimum/onnxruntime/usage_guides...

It's curious how impressions can differ so.

radarsat1 · on April 25, 2023

It's not poorly supported by pytorch but you do have to have it in mind while writing your code. I've been having a hell of a time getting a decent ONNX model out of some pytorch code that was written with lots of dynamic logic inside the modules, it can be quite some work converting this to the necessary "static graph" design needed to export to ONNX properly.

What I'd love to see is a lean, inference-only version of pytorch that just works with existing code, and is smaller without the overhead of autograd and GPU support.

lairv · on April 25, 2023

I was going to answer the same, I find the approach of machine learning compilers that directly compile models to host and device code better than having to bring a huge runtime. There are exciting projects in this area like TVM Unity [1], IREE [2], or torch.export [3]

[1] https://github.com/apache/tvm/tree/unity

[2] https://github.com/openxla/iree

[3] https://pytorch.org/get-started/pytorch-2.0/#inference-and-e...

brucethemoose2 · on April 25, 2023

This is the precise pitch for Facebook's AITemplate as well.

mathisfun123 · on April 25, 2023

Nope AITemplate is just a tracing mechanism plus cutlass plus autotuning. You still need a runtime. AIT also works on very few models (it's a replacement for something called "static runtime").

It's surprising to me how many people just shoot from the hip/guess on this stuff. IKYK and if you don't maybe don't guess?

chillee · on April 25, 2023

No, AIT is a runtime - it exports a .so that you can load from C++ and call how you like.

Narew · on April 25, 2023

The ONNX format have been develop by Microsoft and Faceboof and is really well supported by pytorch as it was develop for exchange between Pytorch and Caffe2. The ONNX Runtime is use by lots of software for inference only to run on windows without having to manage specific vendor provider. (fi, WinML runtime and ONNX runtime is the same code) TVM is also use but currently I have the impression it's more widely use in embed device.

ONNX Runtime enable also provider specific inference code like TensorRT, CUDA, CoreML etc... without having you to change your code

abq2 · on April 25, 2023

What do you mean by "first party compilers"? I've never heard this phrase in this context. Any examples?

minimaxir · on April 25, 2023

tbh Microsoft's marketing does use them interchangeably which does make it annoying.

MuffinFlavored · on April 25, 2023

> no browser-based GPT yet

what else needs to be implemented? will it be possible with the WebGPU API/spec?

pumanoir · on April 25, 2023

A great option but there is wonnx which seems to be more complete and mature. And the bonus is that it's implemented in Rust (if you are into it).

https://github.com/webonnx/wonnx

misterdata · on April 25, 2023

Also WONNX can both be used from native apps (through wgpu it uses Vulkan, DX or Metal) a well as on the web (using WebGPU, WONNX compiled to WebAssembly).

wongarsu · on April 25, 2023

That's an awesome project I didn't know about, and I'll almost certainly use it the next time I need an onnx runtime (just because I want to call it from rust). But I would guess that "more complete and mature" is only true in the narrow scope of "running common networks on webgpu"?

CSSer · on April 25, 2023

Okay, I’ve just got to know: what’s up with the commit messages? I’ve never seen anyone just straight up use numbers before.

turnsout · on April 25, 2023

Wow, you're not joking—there are 13 commits where the messages are just "1" or whatever the number is. There's also "w", "w2", and "w3".

It has the energy of someone who has been forced into version control and deeply resents it.

taylorfinley · on April 25, 2023

Definitely has big "index.old.older.bak_1.bak_2.asp" energy.

keyle · on April 25, 2023

secret cabal binary codes /s

taylorfinley · on April 25, 2023

Looks like something a self-taught dev would do (ask how I know), or someone who usually works fully solo. Maybe it's a ML community thing?

I might do something like this if I was in a rush and just needed a save point, especially if I was working on a branch I knew I'd squash and merge with a more meaningful commit message. I certainly wouldn't be proud of it, but I might do it.

flohofwoe · on April 25, 2023

The main branch history looks reasonably clean. Looks like they generally squash-on-merge (and probably delete the branch on merge), in that case the detailed commit messages are lost anyway.

Completely valid commit strategy if you ask me (even though I bet some might beg to differ).

mirekrusin · on April 25, 2023

If you squash your prs there is no point in spending time on making individual commits nice and isolated or their messages as they'll be aggregated into one anyway.

When you do large PRs you want to checkpoint your work and commits have less meaningful descriptions, it's just next step of steps that you couldn't plan ahead, they're popping up as you go.

flohofwoe · on April 25, 2023

Not sure why you're downvoted, because it looks exactly like that (all changes go into PRs, and PRs are then squashed on merge).

mejutoco · on April 25, 2023

It makes it easier to know what you are squashing.

My approach is to combine commit messages when squashing. With that approach they are useful.

mirekrusin · on April 26, 2023

The fact that you can do that indicates that they could be simply separate PRs merged to main branch independently instead. It's usually desirable as it avoids longer living PR braches that tend to create conflicts and are harder to review.

mejutoco · on April 26, 2023

It is a good point. In many cases, like solo or small teams the conflicts are not a big problem, but I see what you are saying. I have also been in teams that used feature flags more heavily, and it is a possible solution, too.

b_mc2 · on April 25, 2023

A pretty cool library that uses ONNX is transformers.js [1] and they're already working to add WebGPU support.[2]

[1] https://xenova.github.io/transformers.js/

[2] https://twitter.com/xenovacom/status/1650634015060156420

jarym · on April 25, 2023

As an aside, I love ONNX and the main reason I'm sticking with PyTorch. I was able to develop and train an RL model in Python and then convert it to ONNX and call it from C# production code.

It still took a lot of effort but the final version is very performant and reliable.

Culonavirus · on April 25, 2023

It would be great if, one lovely day, we could just slap together an electron app and run inference through webgpu hassle-free and cross-platform.

WorldPeas · on April 25, 2023

Quite the uh... interesting commit strategy...

synergy20 · on April 25, 2023

interesting, is there a helloworld example or tutorial somewhere to check out how this works in real?