Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This 236B model came out around September 6th.

DeepSeek-V2.5 is an upgraded version that combines DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct.

From: https://huggingface.co/deepseek-ai/DeepSeek-V2.5



> To utilize DeepSeek-V2.5 in BF16 format for inference, 80GB*8 GPUs are required.


I wonder if the new mbp can run it at q4.


Using https://github.com/kvcache-ai/ktransformers/, an intel/amd laptop with 128GB RAM and 16GB VRAM can run the IQ4_XS quant and decode about 4-7 token/s, depending on RAM speed and context size.

Using llama.cpp, the decoding speed is about half of that.

Mac with 128GB RAM should be able to run the Q3 quant, with faster decoding speed but slower prefilling speed.


What is "prefiling"?


Assuming you already know what context in terms of LLMs, prefilling is the process of converting the current conversation into tokens and passing that into the LLM.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: