It will work great with 40GB GPU, probably a bit less than twice slower. These a...

utopcell · 2025-10-14T02:34:38 1760409278

How low can this go? Can this run on a 5090 card (32GiB)?

JonathanFly · 2025-10-14T10:33:10 1760437990

Set nproc_per_node-1 instead of 8 (or run the training script directly instead of using torchrun) and set device_batch_size=4 instead of 32. You may be able to use 8 with a 5090, but it didn't work on my 4090. However it's way slower than expected, one H100 isn't 250x the 4090, so I'm not sure it's training correctly. I'll let it run overnight and see if the outputs make any sense, maybe the metrics are not accurate in this config.