Similar to quasi-Newton methods (but not inspired by them), the CMA-ES is a **second
order** approach estimating a positive definite matrix within an
iterative procedure (more precisely: a covariance matrix, that is,
*on convex-quadratic functions*, closely related to the inverse
Hessian). This makes the method feasible on non-separable and/or
badly conditioned problems. In contrast to quasi-Newton methods,
the CMA-ES does not use or approximate gradients and does not even
presume or require their existence. This makes the method feasible
on **non-smooth** and even non-continuous problems, as well as on
multimodal and/or noisy problems. It turns out to be a
particularly reliable and highly competitive evolutionary
algorithm for local optimization (Hansen & Ostermeier
2001) and, surprising at first sight, also for global
optimization (Hansen & Kern
2004, CEC 2005, Hansen 2009 in BBOB-2009).

The CMA-ES has several **invariance
properties**. Two of them, inherited from the plain
evolution strategy, are (i) invariance to order preserving
(i.e. strictly monotonic) transformations of the objective
function value (that is, e.g., ||x||^{2} and
3||x||^{0.2}-100 are *equivalent* objective functions
with identical performance of CMA-ES), and (ii) invariance to
angle preserving (rigid) transformations of the search space
(including rotation, reflection, and translation), if the initial
search point is transformed accordingly. Invariances are highly
desirable: they imply uniform behavior on classes of functions and
therefore imply the generalization of empirical results.

The CMA-ES does not require a tedious
**parameter tuning** for its application. In fact, the choice
of strategy internal parameters is not left to the user
(arguably with the exception of population size λ).
Finding good (default) strategy parameters is considered as part
of the algorithm design, and not part of its
application—the aim is to have a well-performing algorithm
*as is*. The default population size λ is
comparatively small to allow for fast convergence. Restarts with
increasing population size (Auger & Hansen
2005) improve the global search performance. For the
**application** of the CMA-ES, an initial solution, an initial
standard deviation (step-size, variables should be defined such
that the same standard deviations can be reasonably applied to
all variables, see also here)
and, possibly, the termination criteria (e.g. a
function tolerance) need to be set by the user.
The most common applications are model calibration (e.g. curve fitting)
and shape optimisation.

- Four slides on randomized search and the CMA-ES (pdf).
- A written tutorial.
- Tutorial slides (pdf 1.8MB) presented at the GECCO 2013.
- CMA-ES at wikipedia
- Comparison studies
- BBOB 2009 Comparison with a wide variety of other algorithms on 24 benchmark functions
- CEC 2005 Comparison with other Evolutionary Algorithms on a Benchmark Function Set
- Kern et al (2004) compare the CMA-ES with two other Evolution Strategies, (1+1)-ES and CSA-ES, and with two Estimation of Distribution Algorithms, IDEA and MBOA, on 14 separable and non-separable benchmark functions (pdf 827KB).

- Source code page (C, C++, Java, Matlab, Octave, Python, Scilab)
- A list of references to
applications (pdf) of the CMA-ES (not quite exhaustive and
*entirely*outdated though).

- Hansen et
al (1995). On the adaptation of arbitrary normal mutation distributions
in evolution strategies: The generating set adaptation. In L. Eshelman
(Ed.),
*Proceedings of the Sixth International Conference on Genetic Algorithms*, Pittsburgh, pp. 57-64. Morgan Kaufmann.The generating set adaptation (GSA) is the first evolution strategy to our knowledge that learns a problem scaling

*and*is invariant to coordinate system rotations. The GSA implements the principle idea of CMA, to increase the likelihood of selected steps, without using the formalism of a covariance matrix. Global step-size control is done by mutative self-adaptation. The paper also provides an algorithm that learns individual step-sizes and one direction (AII). AII has a linear number of strategy parameters and hence learns a less rich model quicker. - Hansen
and Ostermeier (1996). Adapting arbitrary normal mutation
distributions in evolution strategies: The covariance matrix
adaptation. In
*Proceedings of the 1996 IEEE International Conference on Evolutionary Computation*, pp. 312-317.The first CMA paper, where the covariance matrix adaptation is introduced into the (1,λ)-ES (μ=1). The paper emphasizes on the evolution path and the differences to the generating set adaptation (Hansen et al 1995). Using the formalism of a covariance matrix has three important consequences. First, path length control (cumulative step-size adaptation, CSA) can be used to control the global step-size. This will become particularly relevant when μ is chosen to be greater than 1. Second, the eigensystem, determined by the covariance matrix, can give insights into the underlying problem structure. De facto, it turns out to be particularly useful to monitor the time evolution of eigenvalues of the covariance matrix. Third, the computational expense to

*generate*a new individual reduces from*O(n*to^{3})*O(n*. Because the eigendecomposition can be postponed for^{2})*Ω(n)*iterations, the overall internal expenses become*O(n*.^{2}) - Hansen
and Ostermeier (1997). Convergence properties of evolution strategies
with the derandomized covariance matrix adaptation: The
(μ/μ ES. In_{I}, λ)-*EUFIT'97, 5th Europ.Congr.on Intelligent Techniques and Soft Computing, Proceedings*, pp. 650-654.Introduction of the "multi-membered" (μ/μ,λ)-CMA-ES with intermediate recombination of all μ parental solutions. Investigation of the effect of μ for small λ on unimodal test functions, some including distorted selection. The (3/3,12)-CMA-ES is recommended as the best compromise for being fast and robust.

- Hansen
and Ostermeier (2001). Completely Derandomized Self-Adaptation in
Evolution Strategies.
*Evolutionary Computation*, 9(2), pp. 159-195 (2001).A comprehensive article, where weighted recombination is introduced in the (μ/μ,λ)-CMA-ES. Objectives and rationales behind the strategy are discussed as well as (default) strategy parameter choices in detail. Comparisons with the so-called

*correlated mutations*and scale-up investigations, up to 320-D, on a considerable number of unimodal test functions are shown. On the multimodal Rastrigin function the myth that (local) adaptation to the topography of the function jeopardizes global search properties is disproved. Local adaptation allows for larger step-sizes and consequently is likely to improve global search performance. -
Hansen et al (2003). Reducing
the Time Complexity of the Derandomized Evolution Strategy with
Covariance Matrix Adaptation (CMA-ES).
*Evolutionary Computation*, 11(1), pp. 1-18 (2003).Extension with the so-called rank-μ update of the covariance matrix which scales up the efficiency of the CMA to large population sizes. The rank-μ update reduces the time complexity, i.e. the number

*of generations*to adapt the complete matrix roughly from 10*n*^{2}to 20*n*. The other way around, the learning speed, with respect to the number of objective function evaluations, depends much less on the population size if the rank-μ update is used. Reasonable population sizes range between 5 and 5*n*. For larger population sizes a deficiency of the step-size adaptation is discovered (that has been solved in Hansen and Kern (2004)). Simulations on a number of unimodal test functions and scale-up investigations are presented. -
Hansen and Kern (2004). Evaluating the CMA Evolution Strategy on
Multimodal Test Functions. In
*Eighth International Conference on Parallel Problem Solving from Nature PPSN VIII, Proceedings*, pp. 282-291.In this paper 1) the

*weighted*rank-μ update of the covariance matrix is introduced. 2) the step-size adaptation problem discovered in Hansen et al (2003) is addressed by a remarkably improved setting of the step-size control parameters: for large population sizes the backward time horizon of the evolution path is diminished by increasing the parameter*c*_{σ}and the step-size damping*d*_{σ}is increased. 3) the influence of the population size on the performance is investigated on a number of multimodal functions, revealing that an increased population size can dramatically improve the global search performance. Comparisons to other optimization algorithms reveal that on additively decomposable (and therefore separable) functions the CMA-ES can be vastly outperformed, e.g., by*differential evolution*, while it reveals superior performance on non-separable functions. - Auger
and Hansen (2005). A Restart CMA Evolution Strategy With Increasing
Population Size. In
*IEEE Congress on Evolutionary Computation, CEC 2005, Proceedings*, pp. 1769-1776.The remaining strategy parameter

*population size*, λ, that is most important for the global search performance, is removed by implementing a restart mechanism. Before each restart the population size is increased by a factor of two. Otherwise restarts are conducted independently. Tests on a benchmark function set of 25 functions provided for the session on real-parameter optimization reveal highly competitive performance on local*and*global optimization problems, compared to the other submitted algorithms. -
N. Hansen (2006). The CMA Evolution Strategy: A Comparing Review.
In J.A. Lozano, P. Larrañaga, I. Inza and E. Bengoetxea (Eds.).
Towards a new evolutionary computation. Advances in estimation of
distribution algorithms. Springer, pp. 75-102.
The CMA is approached and analyzed from the view point of

*estimating a search distribution*. Important differences to the so-called*estimation of distribution algorithms*(EDAs) are revealed. 1) in CMA the distribution of the selected*steps*is estimated, while in EDAs the selected*points*are modeled. This leads invariably to larger variances in CMA. In particular, a "pure EDA" converges prematurely on a linear function whereas a "pure CMA" (without step-size control) does not. 2) in EDAs the estimation is usually done from scratch, while in CMA the distribution is updated (at least for population size λ < 4*n*^{2}). Therefore in CMA the estimation can be done reliably*independent of the population size*and, for a small population size, in a lesser number of function evaluations. 3) in CMA an additional global step-size adaptation is conducted, based on a different adaptation principle (cumulative step-size adaptation, or path length control) that often leads to a considerably larger population spread and prevents premature convergence even more effectively. Major concepts and design principles, namely change rates, invariance, and stationarity are discussed. - Jastrebski
and Arnold (2006). Improving evolution strategies through active
covariance matrix adaptation. In
*2006 IEEE World Congress on Computational Intelligence, Proceedings*, pp. 9719-9726Introduces an additional

*negative*rank-μ update of the covariance matrix using the μ worst of the λ individuals. All vectors for both, the original and the negative, rank-μ updates have the same weight. For population sizes up to λ=4*n*a significant speed-up compared to the original equally weighted rank-μ update can be observed. On the discus (or tablet) function the speed-up exceeds a factor of two for large dimensions and small population sizes. While positive definiteness of the covariance matrix cannot be guaranteed anymore, empirically the covariance matrix remains always positive definite. For a larger population size presumably the learning parameter β needs to be modified. - Igel et al
(2006). A Computational Efficient Covariance Matrix Update and a
(1+1)-CMA for Evolution Strategies. In
*Genetic and Evolutionary Computation Conference, GECCO 2006, Proceedings*, pp.453-460.This paper introduces 1) the CMA into the (1+1)-ES using an improved variant of a success rule based step-size adaptation in place of the path length control, and 2) an incremental

*O(n*Cholesky update of the covariance matrix replacing the original^{2})*O(n*covariance matrix update altogether with the iterative^{2})*O(n*eigendecomposition of the covariance matrix (the latter is usually postponed until after^{3})*n/10*generations). The resulting algorithm is an elegant CMA variant and accomplishes what is expected. Nevertheless, its practical relevance is limited. The (1+1)-ES is slightly faster than its ","-counterpart, but more prone to converge to local optima. Applying the Cholesky update prohibits to utilize the evolution path for the covariance matrix update. This is a significant disadvantage as the evolution path considerably improves the performance on many functions and was addressed in Suttorp et.al. (2009). - 2007- to be done.