While LeCunn and Ng are real world experts on AI and Deep Learning, the other two people in the article have very little technical understanding of deep learning research.
The huge triumph of DL has been figuring out that as long as you can pose a problem in a differentiable way and you can obtain a sufficient amount of data, you can efficiently tackle it with a function approximator that can be optimized with first order methods - from that, flows everything.
We have very little idea how to make really complicated problems differentiable. Maybe we will - but right now the toughest problems that we can put in a differentiable framework are those tackled by reinforcement learning, and the current approaches are incredibly inefficient.
> The huge triumph of DL has been figuring out that as long as you can pose a problem in a differentiable way and you can obtain a sufficient amount of data, you can efficiently tackle it with a function approximator that can be optimized with first order methods - from that, flows everything.
This isn't really what is responsible for the success of deep learning. Lots and lots of machine learning algorithms existed before deep learning which are essentially optimizing a (sub-)differentiable objective function, most notably the LASSO. Rather, it's that recursive / hierarchical representation utilized by DL is somehow a lot better at representing complicated functions than things like e.g. kernel methods. I say "somehow" because exactly why and to what extent this is true is still an active subject of research within theoretical ML. It happen in many areas of math that "working in the right basis" can dramatically improve one's ability to solve certain problems. This seems to be what is happening here, but our understanding of the phenomenon is still quite poor.
The "somehow" part reminds me of studying differential equations. A lot of them can't be solved step-by-step analytically, like the equation for an RLC circuit, but some really smart guy figured them out through guesswork. Same goes for deep learning. A lot of these setups like the "Inception" framework and LeNet5 seem like a lot of clever guesswork and intuition. However, unlike differential equations, you can't know that you've actually solved the problem perfectly once you've figured out the answer. It's always just good enough util the next paper comes out claiming a better solution on whatever the latest canonical benchmark is.
Bostrom is an expert on the thing he is talking about, the control problem and the long term of AI. He didn't make any specific claims about the near term future of AI, or deep learning.
>We have very little idea how to make really complicated problems differentiable.
All of the problems that deep learning solves were once called "really complicated" and "unndifferentiable". There's nothing inherently differentiable about image recognition, or go playing, or predicting the next word in a sentence, or playing an Atari game. NNs can excel at these tasks any way, because they are really good at pattern recognition. Amazingly good. And this is an extremely general ability that can be used as a building block for more complex things.
The long-term future of AI is not a subject that allows for expertise. We simply cannot know what will happen, and in what context AI will develop.
So Bostrom is no more an expert in that field than a corner-store psychic is an expert on predicting her clients' love lives. But both are very comfortable speculating, and have a knack for framing the discussion in a way that appeals to their listeners.
If I take what your first sentence as true then it follows that no one (LeCun, Ng, Bostrom) is an expert, nor do any of them exceed the corner-store psychic's ability to predict what will come of AI.
I don't find that too hard to believe, personally, but it's unfair to single out Bostrom as a charlatan if you really believe that there's no way of knowing what will happen post-development of human-level AI.
I'm singling out Bostrom because he attempts to predict events over a much larger timescale, which diminishes the value of those predictions, because they are contingent on so many other developments we can't know. LeCun and Ng, as I said, are much more modest in their forecasts, aiming at the nearer term. Bostrom is basically dealing in the outcomes of very long joint probabilities, which, multiplied by the probability of each individual variable, is very low.
But LeCun and Ng also made long term predictions. Lecun said it's unlikely that AI will be dangerous, and Ng said it's unlikely to happen in our lifetime. Both of those predictions are just as speculative as Bostrom's.
>Bostrom is basically dealing in the outcomes of very long joint probabilities, which, multiplied by the probability of each individual variable, is very low.
That's not how probability works. You can't just assume Bostrom is wrong by default. That might work for futurists like Hanson or Kurzweil, which do that. But Bostrom isn't making a huge number of assumptions. His book presents a large number of possible futures, and the dangers of each.
Bostrom has to stack assumption on tenuous assumption because that's the nature of the problem. but some things get more certain the closer to a genuine superintelligence one gets, such as the assertion that it is the last invention humans need devise.
Bostrom is also using the tools of human philosophy assuming they are general enough to apply to superintelligence. So he comes off as inherently anthropomorphizing even as he warns against doing just that.
He said Superintelligence was a very difficult book to write and that's probably part of what he meant by "difficult."
There is plenty to doubt. One big doubt about the danger of AI is that AI is not an animal. It is not alive like an animal. It has no death like an animal has. It doesn't need a "self." It doesn't propagate genes. It did not evolve through natural selection. So, except for Bostrom's use of whole brain emulation as a yardstick, there isn't much of the commonplace things that makes humans dangerous that needs to be in an AI.
But if the ideas of "strategic advantage" are in general correct, in the way Bostrom uses them, then Bostrom is right to say we are like a child playing with a bomb.
To quote Wittgenstein: “Whereof one cannot speak, thereof one must be silent.” Writing about the the unwritable makes books hard to write. The tools of philosophy, in the 20th century at least, rely heavily on obscurantism and impenetrable jargon, to which Bostrom is no stranger.
I found Superintelligence very readable. Very little jargon. I suspect he is wrong and that superintelligence is the Y2K of the 21st century. I suspect that super-machine-brains will dominate us as much as power plants sneer at our puny energy output. The most alarming scenarios assume anything, like practical nanotech, could be created by bootstrapping from a powerful brain in a box. And once you make that assumption, kaboom.
Joint probability goes both ways: Ng's prediction that X% of workers will become obsolete is just as unlikely as any future, including Bostroms, as any long string of probable outcomes is highly unlikely.
We can't know exactly what will happen, but we can make informed speculation. Bostrom studied many different scenarios for the future of AI, and the arguments for and against them. He's even contributed to AI safety research, the study of methods to control future intelligent algorithms.
So yes he absolutely is more informed than a psychic, or even an expert in today's machine learning. And he isn't even making very specific predictions. He's not giving an exact timetable of when we will have AI, or what algorithms it will use. Just a very general prediction that we will get superintelligence someday, and that it will be very dangerous if we don't solve the control problem.
These seem like very reasonable statements, and have strong arguments for them IMO.
Can you translate your great technical summary into a few potential applications to help us understand what might be on the near horizon for AI/DL?
LeCunn cites photo recognition, Ng cites autonomous trucks, Nosek cites auto-scaling difficulty for online courses and some kind of magnetic brain implant.
Seems to me that these are all fringe/isolated use-cases - like learning to use your fingers one at a time without learning the concept of how to use them together to grasp an object. Perhaps once we get better with each of these fringe "senses" we'll be able to create some higher level intelligence that can leverage learnings across dimensions that we haven't even considered yet.
I think most people are very bad at predicting the future, and I know that I'm very bad at it, so I won't try to make detailed predictions about the long term effect of AI since I would get it all spectacularly wrong.
However, in the short term, I think the areas where AI has the potential to make the biggest impact are where we have huge amounts of data (and gathering data is cheap/getting cheaper) and we have an obvious objective. I am really excited to see what Deep Mind will do with their collaboration with the NHS, and I think some incredibly exciting thing will happen when people with a deep understanding of biology collaborate with people who really understand deep learning.
The difficulty is that in industry the people that often have an easy time raising funds are impressive bullshit artists that are great at selling stuff (Theranos, Verily) while in academia people often spend too much time in their own comfortable silos, and straddling multiple fields is a very difficult act to pull off if you want to ever get tenure.
TBF, it's extraordinarily difficult for a full-time scientist to do better at PR than a full-time bullshitter. Some (e.g., Ng) pull it off, but they tend to be exceptionally brilliant, exceptionally dedicated to their work, and well-supported by their exceptional employers.
Most academic people tend to spend all their time in their own comfortable silos for the same reason most industry people send time in their (MUCH more comfortable!) siloes -- there's enough to keep everyone busy for 40-60+ hours/wk without leaving the silos, and families/eating/sleeping are all important too.
Not to mention they all solve problems for which the training sets lie on low-dimensional manifolds within a very high-dimensional space. And this brings about arbitrary failures when one goes out of sample and it also serves as the basis for creating adversarial data with ease (use the gradient of the network to mutate the data just a tiny little bit).
I suspect there's a promising future in detecting and potentially correcting for out of sample data, even if the methods for doing so are non-differentiable.
> The huge triumph of DL has been figuring out that as long as you can pose a problem in a differentiable way and you can obtain a sufficient amount of data, you can efficiently tackle it with a function approximator that can be optimized with first order methods - from that, flows everything.
But this only defines under which circumstances Deep Learning will produce a solution, this doesn't tell us why DL has been so effective. There is little point in an algorithm guaranteed to converge if that algorithm has little applicability.
To me, DL's triumph was being able to make advances in fields that were at a standstill with traditional methods (CV is an obvious one, but natural language processing is a good one as well). This in turn has attracted enough attention that DL is now being considered for a very wide variety of problems. Obviously, DL won't be successful on all of them but that's science as usual.
Bostrom brings a different set of skills to the table, to ignore his lack of technical understanding is literally to ignore the problem itself: that our technology is so powerful a social force that sometimes even its creators can't fathom its impact.
Oh, much simpler than that - simply that you can pose the problem you are trying to solve in a way that the derivative can be calculated. If you can, you can bring about the whole deep learning machinery to bear on it.
Being able to calculate the derivative is very useful (the backprop rule, etc.). However it isn't actually necessary here, since you can approximate the slope by just sampling points nearly. Strict differentiability isn't the crucial property here, but it does make things faster.
What is, I think, is that the error function is continuous and there is a monotonic path of improvement from most of the space to good solutions in it. Then you can just descend in the error function until you reach it.
We don't really understand why that happens in deep learning, that is, why the error functions of deep neural networks tend to not have complex shapes with local minima that you can get stuck in. This is currently being investigated.
This also isn't a problem limited to deep learning. In evolution, the same is true - we are limited to small local improvements, but over large amounts of time nature arrives at amazingly efficient solutions. That suggests that the fitness function space over genomes is also, to some extent, continuous and allows monotonic improvement. Again, we don't really know why, yet.
Finite difference approximations to the derivative are not practical at all when doing deep learning. Derivative using backpropagations can be calculated at the same speed as the forward pass - while derivative approximated using finite differences require 2 passes PER parameter, and run into numerous numerical issues with accuracy.
You don't actually need to calculate the derivative accurately. It's enough to sample several points near the current one and continue the search to the best of those.
The point here is that we are descending in a continuous and monotonic loss function. You don't need an accurate derivative or even an estimate of the derivative to do so, although it definitely helps a great deal.
Except that you are talking about the derivative with respect to each weight. In a deep neural network, you'd need several sample points for each weight in order to compute dC/dw. That is a LOT of work for a single step. The only way to do this in a timely fashion is via backprop, which requires a derivative.
is there a term or subject area where one could learn more about the investigations underway on error functions that tend to not have complex shapes with local minima to get stuck in?
He's talking about the operations needing to be differentiable. So like multiplication, addition etc are differentiable. tanh, sigmoid functions are differentiable. You need this because back propagation is an application of the chain rule, so each constituent operation needs to be differentiable.
A lot of interesting advances have come from making operations that seems like they aren't differentiable, into differentiable operations that can be trained by backpropagation. See LSTMs and neural turing machines
The huge triumph of DL has been figuring out that as long as you can pose a problem in a differentiable way and you can obtain a sufficient amount of data, you can efficiently tackle it with a function approximator that can be optimized with first order methods - from that, flows everything.
We have very little idea how to make really complicated problems differentiable. Maybe we will - but right now the toughest problems that we can put in a differentiable framework are those tackled by reinforcement learning, and the current approaches are incredibly inefficient.