weight_optimizer – Selection of weight optimizers ================================================= Description +++++++++++ A weight optimizer is an algorithm that adjusts the synaptic weights in a network during training to minimize the loss function and thus improve the network's performance on a given task. This method is an essential part of plasticity rules like e-prop plasticity. Currently two weight optimizers are implemented: gradient descent and the Adam optimizer. In gradient descent [1]_ the weights are optimized via: .. math:: W_t = W_{t-1} - \eta g_t \,, \\ where :math:`\eta` denotes the learning rate and :math:`g_t` the gradient of the current time step :math:`t`. In the Adam scheme [2]_ the weights are optimized via: .. math:: m_0 &= 0, v_0 = 0, t = 1 \,, \\ m_t &= \beta_1 m_{t-1} + \left( 1- \beta_1 \right) g_t \,, \\ v_t &= \beta_2 v_{t-1} + \left( 1 - \beta_2 \right) g_t^2 \,, \\ \alpha_t &= \eta \frac{ \sqrt{ 1- \beta_2^t } }{ 1 - \beta_1^t } \,, \\ W_t &= W_{t-1} - \alpha_t \frac{ m_t }{ \sqrt{v_t} + \hat{\epsilon} } \,. \\ Note that the implementation follows the implementation in TensorFlow [3]_ for comparability. The TensorFlow implementation deviates from [1]_ in that it assumes :math:`\hat{\epsilon} = \epsilon \sqrt{ 1 - \beta_2^t }` to be constant, whereas [1]_ assumes :math:`\epsilon = \hat{\epsilon} \sqrt{ 1 - \beta_2^t }` to be constant. When `optimize_each_step` is set to `True`, the weights are optimized at every time step. If set to `False`, optimization occurs once per spike, resulting in a significant speed-up. For gradient descent, both settings yield the same results under exact arithmetic; however, small numerical differences may be observed due to floating point precision. For the Adam optimizer, only setting `optimize_each_step` to `True` precisely implements the algorithm as described in [2]_. The impact of this setting on learning performance may vary depending on the task. Parameters ++++++++++ The following parameters can be set in the status dictionary. ====================== ==== ========================= ========= ================================= **Common optimizer parameters** ------------------------------------------------------------------------------------------------- Parameter Unit Math equivalent Default Description ====================== ==== ========================= ========= ================================= ``batch_size`` 1 Size of batch ``eta`` :math:`\eta` 1e-4 Learning rate ``optimize_each_step`` ``True`` ``Wmax`` pA :math:`W_{ji}^\text{max}` 100.0 Maximal value for synaptic weight ``Wmin`` pA :math:`W_{ji}^\text{min}` -100.0 Minimal value for synaptic weight ====================== ==== ========================= ========= ================================= ========= ==== =============== ================== ============== **Gradient descent parameters (default optimizer)** ---------------------------------------------------------------- Parameter Unit Math equivalent Default Description ========= ==== =============== ================== ============== ``type`` "gradient_descent" Optimizer type ========= ==== =============== ================== ============== =========== ==== ================ ======= ================================================= **Adam optimizer parameters** ------------------------------------------------------------------------------------------- Parameter Unit Math equivalent Default Description =========== ==== ================ ======= ================================================= ``type`` "adam" Optimizer type ``beta_1`` :math:`\beta_1` 0.9 Exponential decay rate for first moment estimate ``beta_2`` :math:`\beta_2` 0.999 Exponential decay rate for second moment estimate ``epsilon`` :math:`\epsilon` 1e-7 Small constant for numerical stability =========== ==== ================ ======= ================================================= The following state variables evolve during simulation. ============== ==== =============== ============= ========================== **Adam optimizer state variables for individual synapses** ---------------------------------------------------------------------------- State variable Unit Math equivalent Initial value Description ============== ==== =============== ============= ========================== ``m`` :math:`m` 0.0 First moment estimate ``v`` :math:`v` 0.0 Second moment raw estimate ============== ==== =============== ============= ========================== References ++++++++++ .. [1] Huh D, Sejnowski TJ (2018). Gradient descent for spiking neural networks. Advances in Neural Information Processing Systems, 31:1433-1443. https://proceedings.neurips.cc/paper_files/paper/2018/hash/185e65bc40581880c4f2c82958de8cfe-Abstract.html .. [2] Kingma DP, Ba JL (2015). Adam: A method for stochastic optimization. Proceedings of 3rd International Conference for Learning Representations (ICLR). https://doi.org/10.48550/arXiv.1412.6980 .. [3] https://github.com/keras-team/keras/blob/v2.15.0/keras/optimizers/adam.py#L26-L220 See also ++++++++ Examples using this model +++++++++++++++++++++++++ .. listexamples:: eprop_synapse_bsshslm_2020