Highlights
Simpler execution via Java argfiles
Improved performance on FP16/Int8 LLM Inference on hashtag#Nvidia GPUs
Extended reduced precision type support for GPUs (Int8, fp16)
Zero-copy object support through project Panama
Support for compressed oops on modern JVMs
New cross-platform SDK distribution (soon hashtag#SDKMAN! https://lnkd.in/d8pGHYy5)
Official TornadoVM dependencies now published on Maven Central.
(https://lnkd.in/dDRZj8ru)
We took Llama3.java and we ported TornadoVM to enable GPU code generation. Apparrently, the first beta version runs on Nnvidia GPUs, while getting a bit more than 100 toks/sec for 3B model on FP16.
All the inference code offloaded to the GPU is in pure-Java just by using the TornadoVM apis to express the computation.
Runs Llama3 and Mistral models in GGUF format.
It is fully open-sourced, so give it a try. It currently run on Nvidia GPUs (OpenCL & PTX), Apple Silicon GPUs (OpenCL), and Intel GPUs and Integrated Graphics (OpenCL).
Llama Deck is a command-line tool for quickly managing and experimenting with multiple versions of llama inference implementations. It can help you quickly filter and download different llama implementations and llama2-like transformer-based LLM models. We also provide some Docker images based on some implementations, which can be easily deploy and run through our tool.
A comprehensive analysis of the memory behavior of 30 Dacapo and Renaissance Java applications using a dual profiling methodology with NUMAProfiler and PerfUtil in MaxineVM, identifying various memory pressures and JVM impacts.