Nvidia Announces Tesla P40 and P4 - Neural Network Inference, Big and Small

wyldfire · on Sept 13, 2016

Wow, I've been out of the GPU game for a year or two now and it's clear how the market has shifted (or how NVIDIA wants it to move). Back in the day we'd keep asking NVIDIA and AMD for half precision support, looks like not only did they do that by now but there's 8-bit integer support too! I was about to say "well who the heck would be able to use 8-bit integers for much?" when I saw in TFA: "offering an 8-bit vector dot product with 32-bit accumulate."

In case it's not clear from the title (which used to read "...47 INT TOPS"), that's 47 [8-bit integer] tera-operations-per-second. Anandtech says it will "... offer a major boost in inferencing performance, the kind of performance boost in a single generation that we rarely see in the first place, and likely won’t see again." No kidding!

Eridrus · on Sept 13, 2016

I've read a lot about fp16 being good enough for training, but the thing that people don't mention is that just swapping fp32 for fp16 will make things fail because your deep learning framework doesn't implement everything you use for fp16, and after you fix that, it will probably diverge because standard practices aren't used to dealing with such a limited range.

Which isn't to say that the Deep Learning stacks won't get there eventually, but at the moment it's not as easy as flipping a switch.

aab0 · on Sept 13, 2016

For a 4x boost, I imagine Tensorflow and Torch will get support shortly after the GPUs start shipping in real quantities.

Eridrus · on Sept 13, 2016

It's a 2x boost for training since you can't use int8 for training and need fp16.

INT8 is a 4x for inference, but most people aren't using GPUs for inference atm.

scottlegrand · on Sept 13, 2016

Some of us are...

https://blogs.aws.amazon.com/bigdata/post/TxGEL8IJ0CAXTK/Gen...

Eridrus · on Sept 14, 2016

Fair enough; when I said most people I meant companies who are not AmaGooFaceSoft and their international equivalents. Though I can't quite tell if you're doing GPU batch predictions and storing them or doing them in realtime with Spark Streaming.

Unrelated question though: any chance you will do blog post/paper about how DSSTNE does automatic model parallelism and gets good sparse performance compared to cuSparse/etc?

scottlegrand · on Sept 13, 2016

Or as Urs Hoezel would say: "advancing Moore's Law by 7 years(tm)..." Badum ba bum bum...

And given Frank Seide et al. demonstrated 1-bit SGD in 2014 (https://www.microsoft.com/en-us/research/publication/1-bit-s...) the race to the bottom is just beginning...

1024core · on Sept 13, 2016

Just as a comparison: the P40, at 12TFlops/s, would have made the Top 500 list as recently as 2008: https://www.top500.org/list/2008/06/?page=5

ipunchghosts · on Sept 13, 2016

I wish there was a website where i can type in my device, it gives me the FP32 FLOPS, and then tells me where that device would show up and place on past top 500s.

frou_dh · on Sept 13, 2016

IIRC that was the kind of ad-hoc data discovery and munging that Wolfram Language was demoed as being good for. I thought W-L had an online IDE but I can't find one now.

LeifCarrotson · on Sept 13, 2016

For some reason my brain skipped "Nvidia" and I assumed this was about a budget/low-range version of the Tesla Model S. They already have the Tesla 60, 60D, 70, 70D, 75, 75D, 85, P85D, P90D, and now P100D. Why not add the P40?

Nvidia needs a "Ludicrous Mode" overclock setting for these cards. Push the button for 2.5 seconds of super-high frame rate! With some cool down time required.

hatsunearu · on Sept 13, 2016

>Nvidia needs a "Ludicrous Mode" overclock setting for these cards. Push the button for 2.5 seconds of super-high frame rate! With some cool down time required.

This actually is a thing with GPU Boost 3.0. The card boosts up automatically if the temperatures are low, though lately there were issues with the GPU clock frequency oscillating because there wasn't enough hysteresis.

(This, by the way, means that GPUs are thermally constrained rather than timing-constrained.)

jsight · on Sept 13, 2016

I thought the same thing. Who would buy a model S with such a small battery and why the weird P at the beginning?

Then I saw that it was Nvidia. :)

agentgt · on Sept 13, 2016

Are there any IaaS' that offer access to GPU hardware. I have always wanted to play around with this stuff (as I'm completely ignorant of GPU tech these days) but I'm not interested in buying hardware.

EDIT... apparently when I last googled years ago this was not the case. I wish I could delete this comment :(

Eridrus · on Sept 13, 2016

Despite all the announcements, if you want it by the hour AWS is still your only option.

sp332 · on Sept 13, 2016

Yes, many of them have GPU-focused instances available.

noodles23 · on Sept 13, 2016

I don't know why, but no public cloud to my knowledge offered even the Maxwell GPUs.

As cool as these cards are, I really hope they become available on AWS soon. The current AWS GPU instances are so weak we're contemplating buying a physical desktop setup.

Eridrus · on Sept 13, 2016

Second hand cards are pretty cheap on eBay. Much cheaper than AWS in the medium term.

p1esk · on Sept 13, 2016

Why can't they release an FP16 card without FP64 cores?

Currently, P100 wastes about half its area on dedicated FP64 cores, which no one needs for deep learning.

ipunchghosts · on Sept 13, 2016

FYI, The P40 looks similar in specs to the Titan X (Pascal) but with half the RAM (though the titan x has 12 GB of RAM which is quite a bit).

hatsunearu · on Sept 13, 2016

P40 uses GP102 like the pascal titan X, and has more CUDA cores than the titan X. Just putting that out there.

dharma1 · on Sept 13, 2016

Hope these will appear on public clouds soon