"Table 2 compares the timings on the four scenes in Figure 1 of our
unoptimized RenderFormer (pure PyTorch implementation without
DNN compilation, but with pre-caching of kernels) and Blender Cy-
cles with 4,096 samples per pixel (matching RenderFormer’s training
data) at 512 × 512 resolution on a single NVIDIA A100 GPU."
> Blender Cy-
cles with 4,096 samples per pixel (matching RenderFormer’s training
This seems like an unfair comparison. It would be a lot more useful to know how long it would take Blender to also reach a 0.9526 Structural Similarity Index Measure to the training data. My guess is that with the de-noiser turned on, something like 128 samples would be enough, or maybe even less on some images. At that point on an A100 GPU Blender would be close, if not beating the times here for these scenes.
Nobody runs 4096 samples per pixel. In many cases 100-200 (or even less with denoising) are enough. You might run up to low-1000 if you want to resolve caustics.