I'm currently chipping away at DSC, a tensor library I wrote from scratch to play with large language models. Last week I re-wrote flash attention from scratch in CUDA and was able to get good perf.
I'm currently working on DSC, a tensor library I wrote from scratch in C++ with a PyTorch-like API.
Right now it works on both CPU and GPU (both AMD and NVIDIA) and is capable of running LLMs like Qwen, I'm currently implementing a native profiler to trace CPU and GPU kernels and then I'll work on speed. Goal is to be competitive with PyTorch eager by the end of the year.
Because I happen to know C++ and I just wanted to build something rather than learn a new language. Zig looks very interesting though, there are already other projects in this space that use it with great success (see: https://github.com/zml/zml).
Yes! This was actually one of my initial goals! I actually like to work in a C-style-C++ let's say where I turn off C++ features I don't need and just use the one I actually need like templates, objects ecc...
I find this style to be easy to reason about when it comes to performance.
Right now I can load tensors directly from a safetensors file or from a NumPy array so I don't really have in mind to add my own custom format but I do plan to support GGUF files.
You are absolutely correct! I started working on a sort of compiler a while back but decided to get the basics down first. The templates and switch(s) are not really the issue but rather going back and forth between C & Python. This is an experiment I did a few months ago: https://x.com/nirw4nna/status/1904114563672354822 as you can see there is a ~20% perf gain just by generating a naive C++ kernel instead of calling 5 separate kernels in the case of softmax.
Thanks!
To be honest, it started purely as a learning project. I was really inspired when llama.cpp first came out and tried to build something similar in pure C++ (https://github.com/nirw4nna/YAMI), mostly for fun and to practice low-level coding.
The idea for DSC came when I realized how hard it was to port new models to that C++ engine, especially since I don't have a deep ML background. I wanted something that felt more like PyTorch, where I could experiment with new architectures easily.
As for llama.cpp, it's definitely faster! They have hand-optimizing kernels for a whole bunch of architectures, models and data types. DSC is more of a general-purpose toolkit. I'm excited to work on performance later on, but for now, I'm focused on getting the API and core features right.
You just need a foundation of C/C++. If you already have that then just start programming, it's way better than reading books/guides/blogs (at least until you're stuck!). Also, you can read the source code of other similar projects on GitHub and get ideas from them, this is what I did at the beginning.
Yes, when I designed the API I wanted to keep a clear distinction between Python and C. At some point I had two APIs: 1 in Python and the other in high-level C++ and they both shared the same low-level C API. I find this design quite clean and easy to work with if multiple languages are involved. When I'll get to perf I plan to experiment a bit with nanobind (https://github.com/wjakob/nanobind) and see if there's a noticeable difference wrt ctypes.
Thanks for pointing this out! I'll definitely have to investigate other approaches. nanobind looks interesting but I don't need to expose complex C++ objects, I just need the 'fastest' way of calling into a C API. I guess the goto for this is CFFI?
It's the same thing - both nanobind and cffi compile the binding. The fact that nanobind let's you expose cpp doesn't prevent you from only exposing c. And IMHO nanobind is better because you don't need to learn another language to use it (ie you don't need to learn cffi's DSL).
[1]: https://github.com/nirw4nna/dsc
[2]: https://x.com/nirw4nna/status/1968812772944126329