This 236B model came out around September 6th. DeepSeek-V2.5 is an upgraded vers...

genpfault · on Oct 30, 2024

> To utilize DeepSeek-V2.5 in BF16 format for inference, 80GB*8 GPUs are required.

risho · on Oct 31, 2024

I wonder if the new mbp can run it at q4.

throwdbaaway · on Oct 31, 2024

Using https://github.com/kvcache-ai/ktransformers/, an intel/amd laptop with 128GB RAM and 16GB VRAM can run the IQ4_XS quant and decode about 4-7 token/s, depending on RAM speed and context size.

Using llama.cpp, the decoding speed is about half of that.

Mac with 128GB RAM should be able to run the Q3 quant, with faster decoding speed but slower prefilling speed.

metadat · on Oct 31, 2024

What is "prefiling"?

ikeashark · on Nov 12, 2024

Assuming you already know what context in terms of LLMs, prefilling is the process of converting the current conversation into tokens and passing that into the LLM.