What are some recent advances in nonconvex optimization research?
What are some recent advances in nonconvex optimization research? originally appeared on Quora  the knowledge sharing network where compelling questions are answered by people with unique insights.Answer by Anima Anandkumar, Faculty at UC Irvine, Machine learning researcher, on Quora.
Nonconvex optimization is now ubiquitous in machine learning. While previously, the focus was on convex relaxation methods, now the emphasis is on being able to solve nonconvex problems directly.
It is not possible to find the global optimum of every nonconvex problem due to NPhardness barrier. An alternate approach is: when can it be solved efficiently (preferably in low order polynomial time). Recent theoretical work has established that many nonconvex problems can be solved nearoptimally, but simple iterative algorithms, e.g. gradient descent with random restarts. The conditions for success turn out be mild and natural for many learning problems.
Some examples of guaranteed nonconvex methods:
 Many latent variable models can be learnt by decomposing higher order moment tensors (which is a nonconvex problem). We are guaranteed to learn a consistent model, when the hidden or latent representation is nondegenerate: the effect of one hidden variable cannot be expressed as a linear combination of effects of others. See our paper. Also see an excellent blog post by Rong Ge.
 Nonnegative matrix factorization becomes tractable under a separability condition, see here. This condition may not always hold, but in many problems such as learning PCFGs in NLP, we can always add more features till we have a separable problem. See here.
 Nonconvex problems tend to work better in practice, but until now theory was only available for convex relaxation methods. This is changing fast: many nonconvex methods are proven to succeed in the regimes where convex relaxation succeeds. For instance, for the problem of robust PCA, where the goal is to find a low rank + sparse decomposition of the matrix, we show that a natural nonconvex method, which is more efficient than the convex method, have the same regimes of success. See here.
 For nonconvex problems, the main drawback is the need for good initialization. This is where practitioners tend to have good intuitions and employ domain specific knowledge to engineer these initializations. This is also now being tackled in theory. For instance, for sparse coding problem, we can initialize by finding solution to an overlapping clustering problem. See here and here.
 The above works are relying on gradient descent, or other local methods, with specific initializations. An alternative method is based on smoothing, where the function is progressively transformed into a coarsegrained objective through local smoothing. With sufficient amount of smoothing, it becomes convex, and its global optimum is used as initialization to a finergrained objective which is nonconvex. Recent works analyze when such smoothing methods succeed for nonconvex optimization.
Avoiding saddle points: Saddle points are critical points, where the gradient vanishes, but it is not a local optimum, meaning there exist directions where the objective value improves. Saddle points slow down stochastic gradient descent (SGD), especially as the number of variables grows. This is a very challenging problem, in fact, it has been argued that saddle points are much more of an issue than local optima; seehere. Recent works have made progress on how to escape saddle points efficiently in high dimensions. In fact, if the SGD is noisy enough, this paper shows that it can escape nondegenerate saddle points, which can be determined by second order derivatives. A more challenging case involves saddle points which require higher order derivatives to escape, and we analyze them in a recent work here.
Discrete optimization: So far, I have discussed continuous optimization. There is also interesting work in the analysis of discrete optimization, especially in the context of probabilistic inference. For instance, Stefano Ermon has been analyzing the effect of random projections on information projections; see here. Gibbs sampling is another popular technique for probabilistic inference, but it is challenging to establish a bound on the mixing time. A recent work by Christopher Re's group is trying to provide new analysis tools for Gibbs sampling. Previously, I have worked on learning Markov graph structure of probabilistic models, and show that simple and efficient methods succeed in the regime of "high temperature." You can read more about it here.
Other resources on nonconvex optimization:
 We had a recent workshop on nonconvex optimization at NIPS. Slides of invited talks can be found at NIPS 2015 Workshop on Nonconvex Optimization for Machine Learning: Theory and Practice. You can read my blog post about it here.
 We have an upcoming workshop as well: Advances in nonconvex analysis and optimization
 You can also see my recent interview with David Beyer, where I talk about convex vs. nonconvex optimization: Learning in higher dimensions
 Sanjeev Arora, Moritz Hardt and Nisheeth Vishnoi are maintaining a blog on nonconvex optimization: Off the convex path
More in News

Boy on motorbike collides with cow
NZ Newswire 
Trump's 'nasty woman' remark irks female voters
Associated Press 
Innocence proven after 46 years
Toronto Star 
Queen’s millionpound Brexit loss
The Telegraph 
Cracking Australia's creepiest cold case
9News.com.au 
Pictures of kids engaged to be married trigger outrage
The Washington Post