Depends on the model, if it doesn't fit into VRAM performance will suffer. Response here is immediate (at ~15 tokens/sec) on a pair of ebay RTX 3090s in an ancient i3770 box.
£1200 UKP, so a little less. Targetted at having 48GB (2x24Gb) VRAM for running the larger models; having said that, a single 12Gb RTX3060 in another box seems pretty close in local testing (with smaller models).
Have been trying forever to find a coherent guide on building dual-GPU box for this purpose, do you know of any? Like selecting the MB, the case, cooling, power supply and cables, any special voodoo required to pair the GPUs etc.
I'm not aware of any particular guides, the setup here was straightforward - an old motherboard with two PCIe X16 slots (Asus P8Z77V or P8Z77WS), a big enough power supply (Seasonic 850W) and the stock linux Nividia drivers. The RTX 3090's are basic Dell models (i.e. not OC'ed gamer versions), and worth noting they only get hot if used continuously - if you're the only one using them, the fans spin up during a query and back down between. Good 'smoke test' is something like 'while 1; do 'ollama run llama3.3 "Explain cosmology"'; done.
With llama3.3 70B, two RTX3090s gives you 48GB of VRAM and the model uses about 44Gb; so the first start is slow (loading the model into VRAM) but after that response is fast (subject to comment above about KEEP_ALIVE).
If your model does fit into VRAM, if its getting ejected there will be a startup pause. Try setting OLLAMA_KEEP_ALIVE to 1 (see https://github.com/ollama/ollama/blob/main/docs/faq.md#how-d...).