weight_optimizer – Selection of weight optimizers

Description

A weight optimizer is an algorithm that adjusts the synaptic weights in a network during training to minimize the loss function and thus improve the network’s performance on a given task.

This method is an essential part of plasticity rules like e-prop plasticity.

Currently two weight optimizers are implemented: gradient descent and the Adam optimizer.

In gradient descent [1] the weights are optimized via:

\[\begin{split}W_t = W_{t-1} - \eta g_t \,, \\\end{split}\]

where \(\eta\) denotes the learning rate and \(g_t\) the gradient of the current time step \(t\).

In the Adam scheme [2] the weights are optimized via:

\[\begin{split}m_0 &= 0, v_0 = 0, t = 1 \,, \\ m_t &= \beta_1 m_{t-1} + \left( 1- \beta_1 \right) g_t \,, \\ v_t &= \beta_2 v_{t-1} + \left( 1 - \beta_2 \right) g_t^2 \,, \\ \alpha_t &= \eta \frac{ \sqrt{ 1- \beta_2^t } }{ 1 - \beta_1^t } \,, \\ W_t &= W_{t-1} - \alpha_t \frac{ m_t }{ \sqrt{v_t} + \hat{\epsilon} } \,. \\\end{split}\]

Note that the implementation follows the implementation in TensorFlow [3] for comparability. The TensorFlow implementation deviates from [1] in that it assumes \(\hat{\epsilon} = \epsilon \sqrt{ 1 - \beta_2^t }\) to be constant, whereas [1] assumes \(\epsilon = \hat{\epsilon} \sqrt{ 1 - \beta_2^t }\) to be constant.

When optimize_each_step is set to True, the weights are optimized at every time step. If set to False, optimization occurs once per spike, resulting in a significant speed-up. For gradient descent, both settings yield the same results under exact arithmetic; however, small numerical differences may be observed due to floating point precision. For the Adam optimizer, only setting optimize_each_step to True precisely implements the algorithm as described in [2]. The impact of this setting on learning performance may vary depending on the task.

Parameters

The following parameters can be set in the status dictionary.

Common optimizer parameters

Parameter

Unit

Math equivalent

Default

Description

batch_size

1

Size of batch

eta

\(\eta\)

1e-4

Learning rate

optimize_each_step

True

Wmax

pA

\(W_{ji}^\text{max}\)

100.0

Maximal value for synaptic weight

Wmin

pA

\(W_{ji}^\text{min}\)

-100.0

Minimal value for synaptic weight

Gradient descent parameters (default optimizer)

Parameter

Unit

Math equivalent

Default

Description

type

“gradient_descent”

Optimizer type

Adam optimizer parameters

Parameter

Unit

Math equivalent

Default

Description

type

“adam”

Optimizer type

beta_1

\(\beta_1\)

0.9

Exponential decay rate for first moment estimate

beta_2

\(\beta_2\)

0.999

Exponential decay rate for second moment estimate

epsilon

\(\epsilon\)

1e-7

Small constant for numerical stability

The following state variables evolve during simulation.

Adam optimizer state variables for individual synapses

State variable

Unit

Math equivalent

Initial value

Description

m

\(m\)

0.0

First moment estimate

v

\(v\)

0.0

Second moment raw estimate

References

See also

Examples using this model