scottlegrand's comments

scottlegrand · on Sept 13, 2016

Some of us are...

https://blogs.aws.amazon.com/bigdata/post/TxGEL8IJ0CAXTK/Gen...

Eridrus · on Sept 14, 2016

Fair enough; when I said most people I meant companies who are not AmaGooFaceSoft and their international equivalents. Though I can't quite tell if you're doing GPU batch predictions and storing them or doing them in realtime with Spark Streaming.

Unrelated question though: any chance you will do blog post/paper about how DSSTNE does automatic model parallelism and gets good sparse performance compared to cuSparse/etc?

scottlegrand · on Sept 13, 2016

Or as Urs Hoezel would say: "advancing Moore's Law by 7 years(tm)..." Badum ba bum bum...

And given Frank Seide et al. demonstrated 1-bit SGD in 2014 (https://www.microsoft.com/en-us/research/publication/1-bit-s...) the race to the bottom is just beginning...

scottlegrand · on Aug 18, 2016

Except that up to now at least, CUDA IMO remains the best abstraction for programming multi-core: subsuming away multiple threads, SIMD width, and multiple cores into the language definition. AMBER (http://www.ambermd.org) literally "recompiled and ran" with each succeeding GPU generation since GTX 280 in 2009. 3-5 days of subsequent refactoring then unlocked 80% of the attainable performance gains of each of the subsequent GPUs. DSSTNE (https://github.com/amznlabs/amazon-dsstne) just ran as well, but it's only targeting Kepler and up because the code relies heavily on the __shfl instruction.

So I honestly don't get the Google clang CUDA compiler right now. It's really really cool work, but I don't get why they didn't just lobby NVDA heavily to improve nvcc. With the number of GPUs they buy, I suspect they could have anything they want from the CUDA software teams.

However, if it could compile CUDA for other architectures, sign me up, you'd be my heroes.

For I'd love to see CUDA on Xeon Phi and on AMD GPUs (I know, they're trying). And if Intel poured the same amount of passion and budgeting into building that as they are pouring into fake^H^H^H^Hdeceptive benchmark data and magical powerpoint processors we won't see for at least a year or two (and which IMO will probably disappoint just like the first two), they'd be quite the competitor to NVIDIA, no?

That said, the Intel marketing machine seems to have succeeded in punching NVDA stock in the nose the past few days and in grabbing coverage in Forbes (http://www.forbes.com/sites/aarontilley/2016/08/17/intel-tak...) so maybe they know a thing or two I don't.

scottlegrand · on Aug 17, 2016

Not even wrong. I have two PCs with 4 Titan X (Maxwell) GPUs and a third PC with 4 Titan X (Pascal) GPUs. Both of these systems are available today (I built them myself, total BOM about $7K), and both will destroy 4 Xeon Phi servers at Deep Learning.

The benchmark Intel presented here is as disingenuous as their infamous white paper from 2010: http://pcl.intel-research.net/publications/isca319-lee.pdf

In comparison, a single Knights Landing Xeon Phi will be ~$7K. I know where I put my money. Caveat Emptor.

But Xeon Phi and I go way back here. They've been trying to beat my AMBER GPU code since 2013 or so. Many man years later I believe that a Knight's Corner is now ~35% faster than 2 Xeon CPUs with 1M atoms or more (source: http://adsabs.harvard.edu/abs/2016CoPhC.201...95N)

Meanwhile, the CUDA code has continued to scale with the GPU roadmap and a Titan XP is arguably 9-10x faster than 2 Xeon CPUs. No data is supplied at the low-end for Xeon Phi and I think we can safely assume it's because performance there sucks. (source: http://ambermd.org/gpus/benchmarks.htm)

Xeon Phi? IMO avoid avoid avoid until they start winning head to head 3rd party benchmarking fights like Soumith Chintala's fantastic convnet benchmark data: https://github.com/soumith/convnet-benchmarks

throwaway287391 · on Aug 17, 2016

Yes to all of this. I'm really surprised they didn't compare costs in this blog post. Ignore the DGX-1 row of their table; the really damning comparison is between the 2nd and 4th rows of the table.

With a single 4x GPU server costing around $7k in total (row 4), you get nearly double the performance you get from spending $28k on four Xeon Phi servers (row 2).

And that's assuming you've spent the time and disk replicating your data on all four of those Xeon Phi servers, or went to a likely relatively large amount of engineering effort to ensure that network IO doesn't bottleneck training.

gnufx · on Aug 17, 2016

So, a specific MD code may or may not work well with KNL -- we don't have data. KNL looks quite attractive for other chemistry, given all the vector units, large amount of fast memory, and ability to run realistically-sized examples without the network, or potentially the network-on-chip. We'll see how it pans out.

scottlegrand · on Aug 18, 2016

I prefer to look at it the other way, why don't you point out an existing and important chemistry application where KNL bested its contemporary GPUs, say the best of Knight's Corner versus the best of Kepler (K40 or K80). I'm also open to Knight's Landing versus GP100 (vaporware versus no longer vaporware but hard to get)

I'm genuinely interested here because I can't find this anywhere. I don't think it exists personally.

gnufx · on Aug 19, 2016

KNL is not Knights Corner, and I have limited information on either. I'm interested in data and, more to the point, insight -- not just single benchmark numbers or specific programs, especially if they've had a lot of GPU effort and no tuning for KNL. I don't expect KNL to be particularly good for applications that aren't highly vectorizable, though the memory system may help.

If I manage to access the KNL here, I'll probably run cp2k and gromacs, though single node performance is of limited interest, and ELPA doesn't currently have AVX512-specific support.

scottlegrand · on Aug 20, 2016

Here's your answer for GROMACS (and it sucks)...

http://www.prace-ri.eu/IMG/pdf/wp120.pdf

Even so, right now, little would please me more technologically than a competitive Xeon Phi offering, but while KNL is better than KNC, my inside info says it sucks too (it would have been a lot more interesting, just like Altera's Stratix 10, if it had shipped before GP100 and GP102).

Right now, I have more confidence in AMD GPUs right now than I have in Xeon Phi. This 3rd party benchmark is particularly interesting (and it doesn't look like anyone at NVIDIA is paying any attention to it):

https://techaltar.com/amd-rx-480-gpu-review/2/

Sure, NVIDIA is still in the lead, but not with the ~10x margins they used to have over AMD.

Finally, I figuratively feel like punching the next person who makes the BS scaling argument over raw performance. GPUs scale too if they're coded correctly. And cloud datacenters are the worst place for that given their craptastic ~10 Gb/s interconnect subject to arbitrary network weather effects.

Or butchering Seymour Cray: Your life depends on winning a race, would you bet your life on a 1,350 HP Venom GT or on 20 179 HP Scion FRSs? I mean collectively that's almost 3600 HP, right? Except it's even worse because for GPUs vs CPUs, it's like they priced the Scion FRS like a Venom GT and vice versa.

I wish you luck finding Xeon Phi winning anything but synthetic tests against yesterday's news:

https://www.xcelerit.com/computing-benchmarks/libor/intel-xe...

fnord123 · on Aug 17, 2016

Omnipath also sucks compared to Infiniband. How are they making so many inroads into HPC with these offerings? I mean, aside from their dominant-for-good-reason CPUs.

gnufx · on Aug 17, 2016

Please share the data, particularly for the built-in interfaces on ~70 cores which are supposed to be available this year (given the blanket statement). Omni-path seems to be worrying Mellanox, judging by a recent visit.

scottlegrand · on Aug 10, 2016

I figured Nervana was mostly dead if they were stuck at 28 nm

I figured Intel was down and out in Santa Clara without a strong deep learning play

That all changed today.

Intel has been bouncing all over the place trying to break into deep learning. If they don't screw this up, they just found their way in IMO.

nibnib · on Aug 10, 2016

>I figured Nervana was mostly dead if they were stuck at 28 nm

Why? Is there a competitor that has leveraged a more modern process technology?

scottlegrand · on Aug 10, 2016

Yep, NVIDIA. Nervana's dedicated ASIC will deliver 55 (mostly) int16 TOps in 2017. In contrast, the two Titan XP GPUs I bought last week for a total of $2400 deliver 44 such TOps. Next year, a single Volta GPU will deliver at least 36 so I saw no way for them to win on their own with NVIDIA's GPU roadmap merrily marching along since 2007.

However, getting access to Intel's fabs makes them a lot more interesting and competitive. It's not a slamdunk for Intel yet because they still have to incorporate this into their product line (anyone seen Altera's Stratix 10 yet? Because that was supposed to be 2014's 10+ TFLOP GPU killer), but it's a fantastic acquisition and I wish them the best.

scottlegrand · on June 20, 2016

Not to mention they all solve problems for which the training sets lie on low-dimensional manifolds within a very high-dimensional space. And this brings about arbitrary failures when one goes out of sample and it also serves as the basis for creating adversarial data with ease (use the gradient of the network to mutate the data just a tiny little bit).

I suspect there's a promising future in detecting and potentially correcting for out of sample data, even if the methods for doing so are non-differentiable.

scottlegrand · on May 11, 2016

It's more than that, and it's in use in production at Amazon. 8 TitanX GPUs can contain networks with up to 6 billion weights. As Geoffrey Hinton once said:

"My belief is that we’re not going to get human-level abilities until we have systems that have the same number of parameters in them as the brain."

And you're right that it's a specialized framework/engine. But IMO making it more general purpose is a matter of cutting and pasting the right cuDNN code or we can double down on emphasizing sparse data. Amazon OSSed this partially IMO to see what people would want here.

jrapdx3 · on May 11, 2016

> "My belief is that we’re not going to get human-level abilities until we have systems that have the same number of parameters in them as the brain."

An interesting quote.

Replicating functioning of the brain, or some major subsystem of it, is no doubt going to require far more than just billions of parameters. The cortex contains >15 billion neurons, but there are also the neurons contained in all the other brain structures. Furthermore, neurons connect via dense dendritic trees, the human brain having on the order of 100 trillion synapses.

Adding to the complexity, neurons have numerous "communication ports", including numerous pre- and postsynaptic neurotransmitter receptors, and a wide range of receptors for endocrine, immune system and other types of signals. Message propagation typically involves as well the layer of complex intracellular "second-messenger" transformations.

While it's highly probably future NNs will be developed that do even more amazing things than now possible, I think the challenge of equaling what real brains do is to say the least enormously daunting.

Somebody smarter than me could probably figure out the magnitude, how many nodes or weights it takes for a NN to function like the brain, though I imagine it will be a really impressive number.

Edit: typos

JackFr · on May 11, 2016

> "My belief is that we’re not going to get human-level abilities until we have systems that have the same number of parameters in them as the brain."

While that may be true, I find this compelling:

"The fundamental unit of biological information processing is the molecule, rather than any higher level structure like a neuron or a synapse; molecular level information processing evolved very early in the history of life."

http://www.softmachines.org/wordpress/?p=1558#more-1558

Edit: formatting

fauigerzigerk · on May 11, 2016

>Replicating functioning of the brain, or some major subsystem of it, is no doubt going to require far more than just billions of parameters.

Maybe, but we shouldn't forget that computers do not suddenly lose their capability to function as exact, deterministic, programmable machines just because they happen to run an ANN.

What I mean is that there may be shortcuts to reduce the number of required nodes dramatically.

If you take the state of an ANN after it was trained to perform some specific task, you can ask the question whether there is a simpler function, i.e. one with much fewer parameters, that approximates the learned function.

Sort of like a human with the Occam's razor gene. I think the fact that the number of neurons does not correlate perfectly with intelligence in animals is an indication that there is room for optimization.

scottlegrand · on May 11, 2016

Absolutely 100% agree, but at the same time, I think we will ultimately need to build and evaluate models that can span the memory of more than one processor. I don't think a single GTX Titan X, GTX 1080 or even a server is enough here.

Additionally, data parallelization and ASGD broadly disallow these larger models (yes I know about send/receive nodes in TensorFlow, but they're not general or automatic enough for researchers IMO) while ASGD makes horribly inefficient use of the very limited bandwidth between processors. All IMO of course. There are hacks and tricks here, but I think those should be late stage optimizations, not requirements to achieve scaling.

Finally, I'm a stickler for deterministic computation as someone who spent a decade writing graphics drivers before joining the CUDA team in 2006, but that's pretty much a "hear me now, believe me later" opinion of mine after tracking down too many bizarro race conditions late into the night in that former life :-). Of course, one person's race condition can sometimes be an ANN's regularizer, but I digress.

I also agree we'll do some amazing things with far fewer neurons and weights than an actual human brain, but I'll bet you good money we end up needing more than 12GB to do it. AlphaGo alone was 200+ GPUs, right?

waleedka · on May 11, 2016

Thanks for the clarification. I'd change "early proof of concept" to "a specialized framework", but the other observations stand, I believe.

It's totally fine that it's a specialized framework, and it doesn't need to become general purpose. I just think the product description should do a better job positioning it and explaining what it's NOT intended for to set expectations correctly.

scottlegrand · on May 11, 2016

Lead author of DSSTNE here...

1. DSSTNE was designed two years ago specifically for product recommendations from Amazon's catalog. At that time, there was no TensorFlow, only Theano and Torch. DSSTNE differentiated from these two frameworks by optimizing for sparse data and multi-GPU spanning neural networks. What it's not currently is another framework for running AlexNet/VGG/GoogleNet etc, but about 500 lines of code plus cuDNN could change that if the demand exists. Implementing Krizhevsky's one weird trick is mostly trivial since the harder model parallel part has already been written.

2. DSSTNE does not yet explicitly support RNNs, but it does have support for shared weights and that's more than enough to build an unrolled RNN. We tried a few in fact. CuDNN 5 can be used to add LSTM support in a couple hundred lines of code. But since (I believe) the LSTM in cuDNN is a black box, it cannot be spread across multiple GPUs. Not too hard to write from the ground up though.

3. There are a huge number of collaborators and people behind the scenes that made this happen. I'd love to acknowledge them openly, but I'm not sure they want their names known.

4. Say what you want about Amazon, and they're not perfect, but they let us build this from the ground up and now they have given it away. Google hired me away from NVIDIA (another one of those offers I couldn't refuse) OTOH blind-allocated me into search in 2011 and would not let me work with GPUs despite my being one of the founding members of NVIDIA's CUDA team because they had not yet seen them as useful. I didn't stay there long. DSSTNE is 100% fresh code, warts and all, and I think Amazon both for letting me work on a project like this and for OSSing the code.

5. NetCDF is a nice efficient format for big data files. What other formats would you suggest we support here?

6. I was boarding a plane when they finally released this. I will be benchmarking it in the next few days. TLDR spoilers: near-perfect scaling for hidden layers with 1000 or so hidden units per GPU in use, and effectively free sparse input layers because both activation and weight gradient calculation have custom sparse kernels.

7. The JSON format made sense in 2014, but IMO what this engine needs now is a TensorFlow graph importer. Since the engine builds networks from a rather simple underlying C struct, this isn't particularly hard, but it does require supporting some additional functionality to be 100% compatible.

8. I left Amazon 4 months ago after getting an offer I couldn't refuse. I was the sole GPU coder on this project. I can count the number of people I'd trust with an engine like this with two hands and most of them are already building deep learning engines elsewhere. I'm happy to add whatever functionality is desired here. CNN and RNN support seem like two good first steps and the spec already accounts for this.

8. Ditto for a Python interface, easily implemented IMO through the Python C/C++ extension mechanism: https://docs.python.org/2/extending/extending.html

Anyway, it's late, and it's turned out to be a fantastic day to see the project on which I spent nearly two years go OSS.

shoyer · on May 11, 2016

Thanks for sharing your story!

Let me comment on file formats as someone familiar with both netCDF and deep learning.

I agree that netCDF is a sane binary file format for this application. It's designed for efficient serialization of large arrays of numbers. One downside is that netCDF does not support streaming without writing the data to intermediate files on disk.

Keep in mind that netCDF v4 is itself just a thin wrapper around HDF5. Given that your input format is basically a custom file format written in netCDF, I would have just used HDF5 directly. The API is about as convenient, and this would skip one layer of indirection.

The native file format for TensorFlow is its own custom TFRecords file format, but it also supports a number of other file formats. TFRecords is much simpler technology than NetCDF/HDF5. It's basically just a bunch of serialized protocol buffers [1]. About all you can do with a TFRecords file is pull out examples -- it doesn't support the fancy multi-dimensional indexing or hierarchical structure of netCDF/HDF5. But that's also most of what you need for building machine learning models, and it's quite straightforward to read/write them in a streaming fashion, which makes it a natural fit for technologies like map-reduce.

[1] https://www.tensorflow.org/versions/r0.8/api_docs/python/pyt...

scottlegrand · on May 11, 2016

Thanks for that! And boy, I wish I had the resources the TensorFlow team has to build standards like this and also to write their own custom CUDA compiler.

I do want the multi-dimensional indexing for RNN data though. Maybe support HDF5 directly is the path forward.

Thanks again!

xiphias · on May 11, 2016

Where do you wok now? It's interesting to hear what offer you couldn't refuse after being in so many places

zellyn · on May 11, 2016

https://www.linkedin.com/in/scott-le-grand-b752111

xiphias · on May 12, 2016

Thanks!