This post is more than two years old and may contain outdated information.

Lecture notes on nature-inspired learning algorithms

April 12, 2020

Introduction #

Evolutionary algorithms #

Evolutionary algorithms are search-based global optimization algorithms based on natural phenomenon:

Evolutionary algorithms are very good at approximating solutions, but computational complexity and cost can be a limiting factor in production environments (see Evolved antenna).

Optimization problems #

An optimization problem can be solved numerically by adjusting the input (decision variable), to some function to find minimum or maximum output. The aim is to the find decision variables that gives minimum or maximum output.

A minimization problem is the inverse of maximization, so finding y = MAX[x(10 - x)] is equivalent to y = MIN[-x(10 - x)] (as illustrated below).

y = lambda x : x * (10 - x)
def MAX(fn, res, range_):
    for n in range_:
        res.append((n, fn(n)))
    return print(max(res, key = lambda tuple: tuple[1]))

MAX(y, [], range(100)) # (5, 25)
y = lambda x : -x * (10 - x)
def MIN(fn, res, range_):
    for n in range_:
        res.append((n, fn(n)))
    return print(min(res, key = lambda tuple: tuple[1]))

MIN(y, [], range(100)) # (5, -25)

Optimization problems can be function-based (cost function) or trial-and-error, static or dynamic (subject to change), constrained or unconstrained (most optimization problems are constrained).

To solve a constrained optimization problem: Lagrange multiplier method, MIN[f(x,y,z)] s.b. c = g(x,y,z), where the Lagrange function is L(x,y,z,l) = f(x,y,z) + l(g(x,y,z) - c), and l is the Lagrange multiplier, the stationary point is L(x,y,z,l) = 0

Exhaustive search

An exhaustive search method, or brute-force, is a numerical method that can be used to search a large but finite solution space using combinations of different variables.

Brute-force methods do not get stuck in local minimum with fine sampling and work with both continuous or discontinuous functions (no derivatives). However, brute-force can take long time to find global best, or missed due to under-sampling, and only practical with small number of variables in limited search space.

Nelder-Mead method

The Nelder-Mead method, or downhill simplex method, can be used to find local minimum of function with several variables (a simplex is an elementary shape formed in dimension n + 1, such as 2D triangle, where points are referred to as best, B, good, G, and worst, W.

The process to search for local minimum is repeated until acceptable error and the points BGW are iterated when finding better points.

For example: f(x,y) = x + y with x = 2 and y = 3 gives f(2,3) = 5, and point W = (4,5) gives f(4,5) = 9, so move towards (2,3), which is smaller (minimum)

Line minimization

A line minimization method, or line search, starts at a random point and selects direction to move until cost function changes (increases or decreases depending on goal).

Line search methods are generally robust and fast to converge, but use gradients, so function need to be differentiable, and is not guaranteed to find global minimum:

  1. initialise starting point

  2. compute search direction

  3. compute step length

  4. update, x[k+1] = x[k] - f'(x[k]) / f''(x[k]

  5. check convergence, or repeat from step 2

Random walk

The random walk method use random numbers to move in different directions, which is particularly useful to avoid getting stuck in local minimum or explore different regions to locate additional minimums, concentrate search to smaller region to refine solution:

  1. initialise starting point and best point (this can be the same point)

  2. evaluate starting point

  3. generate random correction

  4. update, x[k+1] = x[k] + D[k] * h[k] where D is random search direction matrix and h is step size vector

  5. evaluate function at point and update best point

  6. return best point, or repeat step 3

Random walk and other methods using random numbers work with discrete and continuous variables, discontinuous functions and non-differential functions (no derivatives), and not as sensitive to initial point selection but more sensitive to control parameters (direction and step size).

Note that solutions found using random numbers are not repeatable, so multiple runs needed to verify.

Genetic algorithms (GA) #

Genetic algorithms describe population-based methods that are inspired by natural selection. Darwin's theory of evolution states that an offspring has many characteristics of its parents, which implies population is stable, and variations in characterisitcs between individuals passed from one generation to the next often is from inherited characteristics, where some percentage of offsprings survive to adulthood.

Binary Genetic Algorithms (BGA) #

Binary genetic algorithms (BGA) work well with continuous and discrete variables and can handle large number of decision variables. The decision variables are evaluated with cost functions. BGA is less likely to get stuck in local minimum.


N[var] number of decision variables of chromosome
N[bits] total number of bits of chromosome
N[pop] population size
N[pop]N[bits] total number of bits in population
X selection rate in step of natural selection
N[keep] = XN[pop] number of chromosomes that are kept for each generation
N[pop] - N[keep] number of chromosomes to be discarded
x[low] lower bound of variable x
x[high] upper bound of variable x
P[n] probability of chromosome n in mating pool N[keep] to be chosen
c[n] cost of chromosome n
C[n] normalised cost of chromosome n
mu mutation rate, which is probability of mutation

Decision variables are represented as chromosomes, such as [v[1], v[2], ..., v[N[var]], where each gene is coded by m-bits, so total number of bits per chromosome is N[bits] = mN[var], cost is evaluated by some cost function, and result is presented as sorted table, or cost table (most fit at top).

A population, N[pop], is a group of chromosomes, each representing a potential solution to function, such as f(x,y), where (x, y) represent some chromosome, such as [1100011, 0011001].

Natural selection

The selection process imitates natural selection, where only best potential solutions are selected. A selection occur each generation, or iteration, and selection rate, X, is fraction of population that survive.

The number of chromosomes that are kept is N[keep], where best are kept and worst are discarded (replaced by offsprings): N[pop] = 8 and X = 0.5, then N[keep] = X * N[pop] = 8 * 0.5 = 4, so keep best four chromosomes.

Thresholding is computationally cheaper than selection rate, where chromosomes with costs lower than threshold survive and higher are discarded (minimization). A new population is generated when no chromsomes are within threshold (threshold can be updated with each iteration).

Binary encoding and decoding

Below is an example of binary encoding and decoding:

Bits required to achieve precision

Below is a method to find bits required to achieve precision of d decimal places given a range:

Selection #

The selection process involves selecting two chromosomes from the mating pool to produce two new offsprings, and the offsprings are either kept or discarded based on parameters:

Weighted random pairing

Weighted random pairing, or roulette wheel weighting, assign probabilities that are inversely proportional to cost to chromosomes in mating pool, where chromosomes with lowest cost have greatest chance to mate:

Tournament selection

Tournament selection is problem independent and work best with larger population size, but sorting is computationally expensive, chromosomes with good quality have higher chance to be selected: randomly pick small subset of chromosomes from mating pool in N[keep], where chromosomes with lowest cost in subset become parent.

Crossover and mutation #

The crossover process create offsprings by exchanging information between parents selected in the selection process:


A mutation is a random alteration of certain bits in cost table with chromosomes, this is to allow the algorithm to explore a wider cost surface by introducing new information:

  1. choose mutation rate, mu, which represent probability of mutation

  2. determine number of bits to be mutated

    • with elitism, mu(N[pop] - 1) * N[bits]
    • without elitism, mu(N[pop]) * N[bits]
  3. flip bits (at random or otherwise)

The selection-crossover-mutation process should continue until the population meets some stopping criteria (convergence), such as finding an acceptable solution, exceeded maximum number of iterations, no changes to chromosomes, no changes to cost, or population statistics on mean and minimum cost.

Continuous Genetic Algorithm (CGA) #

Continuous genetic algorithms (CGA) are similar to a binary genetic algorithm, except that the chromosome is represented by a string of real numbers, and the crossover and mutation process is different. It is more logical to represent machine precision in floating-point and results in faster execution (no encoding and decoding).

CGAs can deal with complex problems with high dimensionality and only limited to internal precision errors in computers.

Crossover and mutation #

Crossover is similar to BGA, but exchanging variables instead of bits, so the crossover point is a random number between first and last variable of parent chromosomes:

In CGA, total number of variables that can be mutated in population is mu * (N[pop] - 1) * N[var] (elitism) and mutate chosen variables P'[n] = P[n] + C * N[n](0, 1), where C is constant and N[n](0, 1) is normal distribution.

Advanced topics in GA #

Handling expensive cost functions #

A complicated cost function can be computationally expensive and time consuming for evaluation. Here are a few approaches to reduce the number of function evaluations:

Multiple-objective optimization #

Multiple-objective optimization refers to optimizing for a set of cost functions where objectives conflict:

Weighted sum approach #

Weighted sum approach is a simple approach to multiple-objective optimization, where cost functions are combined into a single cost function by weighting each cost function, as such g(x) = SUM[w * f(x)], where w > 0 and SUM[w] = 1.

Gray codes #

Gray codes are used to reduce the probability of crossover resulting in extreme offsprings, where only one bit is changed between two consecutive numbers in the code. Hamming distance is the number of bits that are different between two numbers in the code.

Permutation problems #

A permutation problem is a problem where the order of variables is important, such as the traveling salesman problem, and normal crossover and mutation is not suitable.

Partally matched crossover

In partally matched crossover (PMX):

  1. select crossover points
P[1] = [3|46|215]
P[2] = [4|15|326]
  1. swap genes between parents (causing duplicate genes)
O[1] = [3|15|215]
O[2] = [4|46|326]
  1. replace duplicate genes with genes from other parent
O[1] = [3|15|2(4)(6)]
O[2] = [(1)|46|32(5)]

Ordered crossover

In ordered crossover (OX):

  1. select crossover points
P[1] = [34|62|15]
P[2] = [41|53|26]
  1. swap genes between parents (causing duplicate genes)
O[1] = [34|53|15]
O[2] = [41|62|26]
  1. cross out duplicate genes in parent
P[1] = [X4|62|X5]
P[2] = [41|53|XX]

O[1] = [??|53|??]
O[2] = [??|62|??]
  1. copy remaining genes from parent
O[1] = [62|53|14]
O[2] = [53|62|41]

Cycle crossover

In cycle crossover (CX):

  1. start from left-most gene in chromosome
P[1] = [3|46215]
P[2] = [4|15326]
  1. swap genes between parents
O[1] = [4|46215]
O[2] = [3|15326]
  1. move to next duplicate gene and swap at position
O[1] = [4|1|6215]
O[2] = [3|4|5326]
  1. repeat step 3 until all genes are unique
O[1] = [4162|2|5]
O[2] = [3453|1|6]
O[1] = [416|3|25]
O[2] = [345|2|16]

Penalty function #

A penalty function is a function that is added to the cost function to penalize infeasible solutions, such as solutions that violate constraints, such as fP(x) = f(x) + SUM[m](C(x)), where fP(x) is penalty cost function, f(x) unpenalised cost function, m number of constraints, and C(x) penalty function imposed on voilating constraints.

Genetic programming (GP) #

Genetic programming is a method used to evolve computer programs, where a program is represented as a tree structure, and the tree is evolved by crossover and mutation. A terminal set specify all variables and constants and function set contains all functions that can be applied to elements in terminal set. An evolved program is evaluated to measure performance within problem domain (fitness).

In a tree-based representation each chromosome is a program. An adaptive chromosome are variable length and size (tree depth), and shape is branching factor of nodes in tree. The domain-specific grammar includes terminal set (variables and constants, leaf nodes in tree), function set (operators, such as arithmetic, boolean, decision structures), and semantic rules (to create correct programs).

The genetic programming process:

  1. initialize population (random or otherwise)

  2. cost function to measure performance

  3. crossover and mutation to exchange information between parents

    • generate one offspring by selecting random node and replacing corresponding sub-tree of parent with other parent
    • generate two offsprings by selecting random node and swap sub-trees between parents
  4. mutate to explore new information

    • function node mutation
    • terminal node mutation
    • swap mutation
    • grow mutation
    • Gaussian mutation
    • Trunc mutation

Schema theorem #

The schema theorem is a template to represent subset of binary strings using symbols (* is any).

*1100 => { 01100, 111000 }
*110* => { 01100, 01101, 11100, 11101 }

Properties of schema:

S[1] = ***001*110 => o(S[1]) = 6, d(S[1]) = 10 - 4 = 6
S[2] = ****00**0* => o(S[2]) = 3, d(S[2]) = 9 - 5 = 4

Evolution strategies (ES) #

Evolution strategy: 1 + 1 #

The 1 + 1 evolutionary strategy, or (1 + 1)-ES, is one parent and one offspring, so population is only one individual, X = [x_1, ..., x_n]. There is no recombination process (crossover) but mutation can be applied with x'(t) = x(t) + a(t) * N_j(0,1), where a is parameter and N is Gaussian noise.

Evolution strategy: mu + 1 #

The (mu + 1)-ES is mu-parent and one offspring, so population is greater than one (mu > 1). Uniform crossover where two random parents are selected, chromosomes in offspring is randomly selected between the parents.

X(t) = [x_1(t), ..., x_n(t)]

x(t) = { x_1(t) if r <= 0.5, x_2(t) otherwise }

Offspring is muted similar to (1 + 1)-ES and best mu individuals from mu + 1 population survive to next generation.

Evolution strategy: mu + lambda #

The (mu + lambda)-ES is mu-parent and lambda-offspring, where best mu individuals from mu + lambda population survive to next generation. Elitism can be applied for faster convergence and better exploitation.

Evolution strategy: mu, lambda (comma) #

The (mu, lambda)-ES generate lambda offsprings from mu parents, where lambda is greater or equal to mu, which is greater or equal to 1. The best mu individuals from lambda offsprings replace parents. Elitism is not applied (better exploration and larger diversity).

Crossover operators #

A crossover operation can be local or global. A local crossover is two randomly selected parents producing one offspring. A global crossover is more than two randomly selected parents produce one offspring.

Discrete recombination is when actual components of parents are used to generate offspring. Intermediate recombination is weighted average of components of parents. No crossover select parent as offspring.

Example of local, discrete recombination:

x_j(t) = { x_1_j(t) if r_j <= 0.5, x_2_j(t) otherwise }
a_j(t) = { a_1_j(t) if r_j <= 0.5, a_2_j(t) otherwise }

r: random number [0,1]
parent 1: [x_11, x_12, x_13], [a_11, a_12, a_13]
parent 2: [x_21, x_22, x_23], [a_21, a_22, a_23]

offspring: [x_11, x_22, x_13], [a_11, a_22, a_13]

Example of local, intermediate recombination:

x_j(t) = r * x_1_j(t) + (1 - r) * x_2_j(t)
a_j(t) = r * a_1_j(t) + (1 - r) * a_2_j(t)

r: random number [0,1]
parent 1: [x_11, x_12, x_13], [a_11, a_12, a_13]
parent 2: [x_21, x_22, x_23], [a_21, a_22, a_23]

offspring: [r * x_11 + (1 - r) * x_21, r * x_12 + (1 - r) * x_22, r * x_13 + (1 - r) * x_23], [r * a_11 + (1 - r) * a_21, r * a_12 + (1 - r) * a_22, r * a_13 + (1 - r) * a_23]

Swarm intelligence #

A swarm intelligence is some system with collective behaviour by unsophisticated agents interacting locally with environment, causing emergent global patterns.

A swarm provides basis to explore collective or distributed problem solving without centralised control or provision of a global model.

Examples of swarms in nature:

Social colonies

A social colony is flexible enough to respond to internal perturbations and external challenges, robust enough to complete tasks even if some individuals fail, decentralised (no central control in colony), and self-organised (paths to solutions are emergent).

Self organisation

Self-organisation is a set of dynamic mechanisms where structures appear at global level from interactions of lower-level components (such as ants):

Stigmergy and artifical pheromone #

A stigmergy is a mechanism that mediate animal-to-animal interactions and form of indirect communication mediated by modifications of environment. Signs observed by individual triggers specific response or action (reinforce or modify signals to influence actions of others, positive or negative).

There are two forms of stigmergy:

An artificial pheromone can be an artifical stigmergy, such as indirect communication mediated by numeric modifications of environment states locally accessible to agent, or stigmergic variable, such as encapsulate information used by artifical ants to communicate indirectly.

Ant colony optimization #

Ants appeared on earth about 100 million years ago and have an estimated population of 10 ** 16 individuals. They are social, live in colonies of 30 to millions, and show collective behaviour (foraging, division of labour, cooperative support, construction).

Ants are stimulus-response agents where an individual perform simple and basic action based on local information. Simple actions appear to have a large random component.

Example of ant colony optimization applications:

Binary Bridge Experiments

The Binary Bridge Experiment is a simple and elegant experiment to study foraging behaviour of ants:

probability to choose path A: P_A(t + 1) = ((c + n_A(t)) ** a) / (((c + n_A(t)) ** a) + ((c + n_B(t)) ** a)) = 1 - P_B(t + 1)

n_A: number of ants on path A
n_B: number of ants on path B
c: degree of attraction of unexpected branch
a: bias using pheromone deposits in decision process

Example artificial ant decision process algorithm:

  1. generate random number, r[0,1]

  2. choose values, c and a

  3. for each path A, calculate P_A(t + 1)

  4. if r <= P_A, then follow path A

Simple Ant Colony Optimization (SACO)

The simple ant colony optimisation method is a graph ((i, j) is edge from node i to node j), where pi_ij(t) is pheromone concentration associated with edge at generation t (pi_ij(0) is assigned small initial random value) and L_k(t) is length of path from source to destination constructed by ant k.

The stopping criteria is maximum number of iterations, an acceptable solution, or most ants following the same path:

p_k_ij(t) = { pi_a_ij(t) / sum[pi_a_iu(t)] if j in N_k_i(t), 0 otherwise }

N_k_i(t): set of feasible nodes connected to node i with respect to and k
a: constant greater than zero
edge: (i, j)

pi_ij(t) = (1 - rho) * pi_ij(t)

rho: evaporation rate (random number between 0 and 1, not inclusive)
pi_ij(t + 1) = pi_ij(t) + sum[delta(pi_ij(t))]

delta(pi_ij(t)) = { Q / f(x_k(t)) if edge in path x_k(t), 0 otherwise }
x_k(t): solution of ant k
f(x_k(t)): quality of solution
Q: constant greater than 0

Ant Colony System (ACS)

An ant system (AS) is an improvement on SACO which includes heuristic information to transition probability, set of feasible nodes (immidiate neighbours, nodes not visited to prevent loops), different update strategies for pheromone intensity, and elitism.

An ant colony system (ACS) is an improvement on AS which includes different transition rule (favour specific nodes) and different pheromone update rule:

Particle Swarm Optimisation (PSO) #

Particle swarm optimization is a numerical and population-based optimisation technique to discover optimal regions of high-dimensional search space using collective behaviour of individuals:

Particle swarms expand the ant-swarm behaviour model to N-dimensional physchosocial space without constraints of physical laws. The swarm consists of set of particles and each particle represent potential solution for some N-dimensional optimization problem.

The velocity updated as previous velocity + cognitive bias (c_1) + social bias (c_2). The stopping criteria is maximum number of iterations, an acceptable solution, no improvement observed over a period of iterations, or swarm radius close to zero.

Global best

The global best particle swarm method use star topology, whole swarm as neighbours, and particles are updated based on social information from all particles in swarm (best position found by swarm):

global: Y(t) = [Y_1(t), ..., Y_n(t)]
personal: y_i(t) = [y_i_1(t), ..., y_i_n(t)]
v_ij(t + 1) = v_ij(t) + (c_1 * r_1_j(t)) * (y_ij(t) - x_ij(t)) + (c_2 * r_2_j(t)) * (Y_j(t) - x_ij(t))

c_n: acceleration constants
r_n: random numbers [0,1]
x_i(t + 1) = x_i(t) + v_i(t + 1)

x_i(t): [x_i_1, ..., x_i_n]
v_i: [v_i_1, ..., v_i_n]
y_i(t + 1) = { y_i(t) if f(x_i(t + 1)) >= f(y_i(t)), x_i(t + 1) otherwise }
min[f(y_1(t + 1), ..., f(y_n(t + 1)))]

Local best

The local best particle swarm method use ring topology, small number of particles as neighbours, and particles updated based on social information exchanged within neighbourhood of particle (local best position within neighbourhood):

global: Y(t) = [Y_1(t), ..., Y_n(t)]
personal: y_i(t) = [y_i_1(t), ..., y_i_n(t)]
v_ij(t + 1) = v_ij(t) + (c_1 * r_1_j(t)) * (y_ij(t) - x_ij(t)) + (c_2 * r_2_j(t)) * (Y_ij(t) - x_ij(t))

c_n: acceleration constants
r_n: random numbers [0,1]
x_i(t + 1) = x_i(t) + v_i(t + 1)

Different methods for selecting neighbourhoods:

Advanced topics in swarm intelligence #

Velocity clamping #

Velocity clamping is when particle velocity is adjusted before updating particle positions.

v'_ij(t + 1) = { v_ij(t + 1) if v_ij(t + 1) < V_max, V_max otherwise }

V_max: limit velocity to user-defined maximum value

An advantage of velocity clamping is that explotion is controlled, where otherwise velocity values could be updated to quickly to large values (pushing particles to boundaries). The disadvantages are that searching direction of particle is changed and particle search hypercube when velocity reach maximum.

Dynamic velocity approach

The dynamic velocity approach is method to change maximum velocity if global best position does not improve over a period of iterations.

V_max(t + 1) = { D * V_max(t) if f(Y(t)) >= f(Y(t - t')), V_max(t) otherwise }

V_max: limit velocity to user-defined maximum value
D: decrease linearly from 1 to 0.01

Example of exponentially decay in maximum velocity:

V_max(t + 1) = (1 - (t / n) ** a) * V_max(t)

a >= 0
n: maximum number of iterations

Inertia weight #

An inertia weight is an mechanism to control exploration and exploitation abilities of swarm without controlling velocity.

v_ij(t + 1) = w * v_ij(t) + (c_1 * r_1_j(t)) * (y_ij(t) - x_ij(t)) + (c_2 * r_2_j(t)) * (Y_j(t) - x_ij(t))

w: * > 0
c_1: cognitive bias
c_2: social bias

Other methods to adjust weight:

Differential evolution #

A differential evolution method is a stochastic and population-based search strategy where distance and direction information of population is used to guide search (difference vectors).

A difference vector provide information about position and fitness of individuals in landscape. A large distance between individuals means larger step sizes (explore more search space) and small distance between individuals means smaller step sizes (exploit local areas).

[X_1, X_2, ...., X_ns] = [[x_11, ..., x_1nx], ... [x_ns1, ..., x_nsnx]]

ns: size of population
nx: number of decision variables
C(t) = [x_1(t), ..., x_ns(t)]
u_i(t) = x_i_1(t) + B(x_i_2(t) - x_i_3(t))

u_i(t): trial vector
x_i_1: target vector
x_i_2: randomly selected individuals
x_i_3: randomly selected individuals
B: scale factor
x'_ij(t) = { u_ij(t) if j in J, x_ij(t) otherwise }

J: set of element indices to undergo crossover (crossover points)

Differential evolution strategies #

A differential evolution strategy is a variation to basic differential evolution with different ways to select target vectors, different number of difference vectors, and different ways to select crossover points.

Notation (DE/x/y/z)

x method of selecting target vector
y number of difference vectors
z crossover method (binomial, exponential)


A random individual from current population selected as target vector with one difference vector, any crossover method, and trial vector (same as basic differential evolution).

u_i(t) = x_i_1(t) + B(x_i_2(t) - x_i_3(t))


The best individual from current population is selected as target vector with one difference vector, any crossover method, and trial vector.

u_i(t) = x_best(t) + B(x_i_2(t) - x_i_3(t))


Any method to select individual from current population as target vector with nv difference vectors, any crossover method, and trial vector.

u_i(t) = x_i_1(t) + B * sum_nv[x_i_2_n(t) - x_i_3_n(t)]


A combination of random and best strategies with nv difference vectors, any crossover methods, and trial vector.

u_i(t) = A * x_best(t) + (1 - A) * x_i_1(t) + B * sum_nv[x_i_2_n(t) - x_i_3_n(t)]

A: random number [0,1] (1 is more exploitation, 0 is more exploration)


The parent vector selected as target vector with 1 + nv difference vectors (first difference vector formed by parent vector and best individual from current population), any crossover method, and trial vector.

u_i(t) = x_i(t) + B * (x_best(t) - x_i(t)) + B * sum_nv[x_i_2_n(t) - x_i_3_n(t)]