UP

## The CMA Evolution Strategy

The CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is an evolutionary algorithm for difficult non-linear non-convex black-box optimisation problems in continuous domain. The CMA-ES is considered as state-of-the-art in evolutionary computation and has been adopted as one of the standard tools for continuous optimisation in many (probably hundreds of) research labs and industrial environments around the world. The CMA-ES is typically applied to unconstrained or bounded constraint optimization problems, and search space dimensions between three and a hundred. The method should be applied, if derivative based methods, e.g. quasi-Newton BFGS or conjugate gradient, (supposedly) fail due to a rugged search landscape (e.g. discontinuities, sharp bends or ridges, noise, local optima, outliers). If second order derivative based methods are successful, they are usually faster than the CMA-ES: on purely convex-quadratic functions, f(x)=xTHx, BFGS (Matlabs function fminunc) is typically faster by a factor of about ten (in terms of number of objective function evaluations needed to reach a target function value, assuming that gradients are not available). On the most simple quadratic function f(x)=||x||2=xTx BFGS is faster by a factor of about 30.

Similar to quasi-Newton methods (but not inspired by them), the CMA-ES is a second order approach estimating a positive definite matrix within an iterative procedure (more precisely: a covariance matrix, that is, on convex-quadratic functions, closely related to the inverse Hessian). This makes the method feasible on non-separable and/or badly conditioned problems. In contrast to quasi-Newton methods, the CMA-ES does not use or approximate gradients and does not even presume or require their existence. This makes the method feasible on non-smooth and even non-continuous problems, as well as on multimodal and/or noisy problems. It turns out to be a particularly reliable and highly competitive evolutionary algorithm for local optimization (Hansen & Ostermeier 2001) and, surprising at first sight, also for global optimization (Hansen & Kern 2004, CEC 2005, Hansen 2009 in BBOB-2009).

The CMA-ES has several invariance properties. Two of them, inherited from the plain evolution strategy, are (i) invariance to order preserving (i.e. strictly monotonic) transformations of the objective function value (that is, e.g., ||x||2 and 3||x||0.2-100 are equivalent objective functions with identical performance of CMA-ES), and (ii) invariance to angle preserving (rigid) transformations of the search space (including rotation, reflection, and translation), if the initial search point is transformed accordingly. Invariances are highly desirable: they imply uniform behavior on classes of functions and therefore imply the generalization of empirical results.

The CMA-ES does not require a tedious parameter tuning for its application. In fact, the choice of strategy internal parameters is not left to the user (arguably with the exception of population size λ). Finding good (default) strategy parameters is considered as part of the algorithm design, and not part of its application—the aim is to have a well-performing algorithm as is. The default population size λ is comparatively small to allow for fast convergence. Restarts with increasing population size (Auger & Hansen 2005) improve the global search performance. For the application of the CMA-ES, an initial solution, an initial standard deviation (step-size, variables should be defined such that the same standard deviations can be reasonably applied to all variables, see also here) and, possibly, the termination criteria (e.g. a function tolerance) need to be set by the user. The most common applications are model calibration (e.g. curve fitting) and shape optimisation.

### Commented Bibliography (chronologically)

1. Hansen et al (1995). On the adaptation of arbitrary normal mutation distributions in evolution strategies: The generating set adaptation. In L. Eshelman (Ed.), Proceedings of the Sixth International Conference on Genetic Algorithms, Pittsburgh, pp. 57-64. Morgan Kaufmann.

The generating set adaptation (GSA) is the first evolution strategy to our knowledge that learns a problem scaling and is invariant to coordinate system rotations. The GSA implements the principle idea of CMA, to increase the likelihood of selected steps, without using the formalism of a covariance matrix. Global step-size control is done by mutative self-adaptation. The paper also provides an algorithm that learns individual step-sizes and one direction (AII). AII has a linear number of strategy parameters and hence learns a less rich model quicker.

2. Hansen and Ostermeier (1996). Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. In Proceedings of the 1996 IEEE International Conference on Evolutionary Computation, pp. 312-317.

The first CMA paper, where the covariance matrix adaptation is introduced into the (1,λ)-ES (μ=1). The paper emphasizes on the evolution path and the differences to the generating set adaptation (Hansen et al 1995). Using the formalism of a covariance matrix has three important consequences. First, path length control (cumulative step-size adaptation, CSA) can be used to control the global step-size. This will become particularly relevant when μ is chosen to be greater than 1. Second, the eigensystem, determined by the covariance matrix, can give insights into the underlying problem structure. De facto, it turns out to be particularly useful to monitor the time evolution of eigenvalues of the covariance matrix. Third, the computational expense to generate a new individual reduces from O(n3) to O(n2). Because the eigendecomposition can be postponed for Ω(n) iterations, the overall internal expenses become O(n2).

3. Hansen and Ostermeier (1997). Convergence properties of evolution strategies with the derandomized covariance matrix adaptation: The (μ/μI, λ)-ES. In EUFIT'97, 5th Europ.Congr.on Intelligent Techniques and Soft Computing, Proceedings, pp. 650-654.

Introduction of the "multi-membered" (μ/μ,λ)-CMA-ES with intermediate recombination of all μ parental solutions. Investigation of the effect of μ for small λ on unimodal test functions, some including distorted selection. The (3/3,12)-CMA-ES is recommended as the best compromise for being fast and robust.

4. Hansen and Ostermeier (2001). Completely Derandomized Self-Adaptation in Evolution Strategies. Evolutionary Computation, 9(2), pp. 159-195 (2001).

A comprehensive article, where weighted recombination is introduced in the (μ/μ,λ)-CMA-ES. Objectives and rationales behind the strategy are discussed as well as (default) strategy parameter choices in detail. Comparisons with the so-called correlated mutations and scale-up investigations, up to 320-D, on a considerable number of unimodal test functions are shown. On the multimodal Rastrigin function the myth that (local) adaptation to the topography of the function jeopardizes global search properties is disproved. Local adaptation allows for larger step-sizes and consequently is likely to improve global search performance.

5. Hansen et al (2003). Reducing the Time Complexity of the Derandomized Evolution Strategy with Covariance Matrix Adaptation (CMA-ES). Evolutionary Computation, 11(1), pp. 1-18 (2003).
6. Hansen and Kern (2004). Evaluating the CMA Evolution Strategy on Multimodal Test Functions. In Eighth International Conference on Parallel Problem Solving from Nature PPSN VIII, Proceedings, pp. 282-291.

In this paper 1) the weighted rank-μ update of the covariance matrix is introduced. 2) the step-size adaptation problem discovered in Hansen et al (2003) is addressed by a remarkably improved setting of the step-size control parameters: for large population sizes the backward time horizon of the evolution path is diminished by increasing the parameter cσ and the step-size damping dσ is increased. 3) the influence of the population size on the performance is investigated on a number of multimodal functions, revealing that an increased population size can dramatically improve the global search performance. Comparisons to other optimization algorithms reveal that on additively decomposable (and therefore separable) functions the CMA-ES can be vastly outperformed, e.g., by differential evolution, while it reveals superior performance on non-separable functions.

7. Auger and Hansen (2005). A Restart CMA Evolution Strategy With Increasing Population Size. In IEEE Congress on Evolutionary Computation, CEC 2005, Proceedings, pp. 1769-1776.

The remaining strategy parameter population size, λ, that is most important for the global search performance, is removed by implementing a restart mechanism. Before each restart the population size is increased by a factor of two. Otherwise restarts are conducted independently. Tests on a benchmark function set of 25 functions provided for the session on real-parameter optimization reveal highly competitive performance on local and global optimization problems, compared to the other submitted algorithms.

8. N. Hansen (2006). The CMA Evolution Strategy: A Comparing Review. In J.A. Lozano, P. Larrañaga, I. Inza and E. Bengoetxea (Eds.). Towards a new evolutionary computation. Advances in estimation of distribution algorithms. Springer, pp. 75-102.

The CMA is approached and analyzed from the view point of estimating a search distribution. Important differences to the so-called estimation of distribution algorithms (EDAs) are revealed. 1) in CMA the distribution of the selected steps is estimated, while in EDAs the selected points are modeled. This leads invariably to larger variances in CMA. In particular, a "pure EDA" converges prematurely on a linear function whereas a "pure CMA" (without step-size control) does not. 2) in EDAs the estimation is usually done from scratch, while in CMA the distribution is updated (at least for population size λ < 4n2). Therefore in CMA the estimation can be done reliably independent of the population size and, for a small population size, in a lesser number of function evaluations. 3) in CMA an additional global step-size adaptation is conducted, based on a different adaptation principle (cumulative step-size adaptation, or path length control) that often leads to a considerably larger population spread and prevents premature convergence even more effectively. Major concepts and design principles, namely change rates, invariance, and stationarity are discussed.

9. Jastrebski and Arnold (2006). Improving evolution strategies through active covariance matrix adaptation. In 2006 IEEE World Congress on Computational Intelligence, Proceedings, pp. 9719-9726

Introduces an additional negative rank-μ update of the covariance matrix using the μ worst of the λ individuals. All vectors for both, the original and the negative, rank-μ updates have the same weight. For population sizes up to λ=4n a significant speed-up compared to the original equally weighted rank-μ update can be observed. On the discus (or tablet) function the speed-up exceeds a factor of two for large dimensions and small population sizes. While positive definiteness of the covariance matrix cannot be guaranteed anymore, empirically the covariance matrix remains always positive definite. For a larger population size presumably the learning parameter β needs to be modified.

10. Igel et al (2006). A Computational Efficient Covariance Matrix Update and a (1+1)-CMA for Evolution Strategies. In Genetic and Evolutionary Computation Conference, GECCO 2006, Proceedings, pp.453-460.

This paper introduces 1) the CMA into the (1+1)-ES using an improved variant of a success rule based step-size adaptation in place of the path length control, and 2) an incremental O(n2) Cholesky update of the covariance matrix replacing the original O(n2) covariance matrix update altogether with the iterative O(n3) eigendecomposition of the covariance matrix (the latter is usually postponed until after n/10 generations). The resulting algorithm is an elegant CMA variant and accomplishes what is expected. Nevertheless, its practical relevance is limited. The (1+1)-ES is slightly faster than its ","-counterpart, but more prone to converge to local optima. Applying the Cholesky update prohibits to utilize the evolution path for the covariance matrix update. This is a significant disadvantage as the evolution path considerably improves the performance on many functions and was addressed in Suttorp et.al. (2009).

11. 2007- to be done.
UP
last change \$Date: 2016-12-15 23:47:51 +0100 (Thu, 15 Dec 2016) \$  