weight_optimizer – Selection of weight optimizers
=================================================

Description
+++++++++++
A weight optimizer is an algorithm that adjusts the synaptic weights in a
network during training to minimize the loss function and thus improve the
network's performance on a given task.

This method is an essential part of plasticity rules like e-prop plasticity.

Currently two weight optimizers are implemented: gradient descent and the Adam optimizer.

In gradient descent [1]_ the weights are optimized via:

.. math::
  W_t = W_{t-1} - \eta g_t \,, \\

where :math:`\eta` denotes the learning rate and :math:`g_t` the gradient of the current
time step :math:`t`.

In the Adam scheme [2]_ the weights are optimized via:

.. math::
  m_0 &= 0, v_0 = 0, t = 1 \,, \\
  m_t &= \beta_1 m_{t-1} + \left( 1- \beta_1 \right) g_t \,, \\
  v_t &= \beta_2 v_{t-1} + \left( 1 - \beta_2 \right) g_t^2 \,, \\
  \alpha_t &= \eta \frac{ \sqrt{ 1- \beta_2^t } }{ 1 - \beta_1^t } \,, \\
  W_t &= W_{t-1} - \alpha_t \frac{ m_t }{ \sqrt{v_t} + \hat{\epsilon} } \,. \\

Note that the implementation follows the implementation in TensorFlow [3]_ for comparability.
The TensorFlow implementation deviates from [1]_ in that it assumes
:math:`\hat{\epsilon} = \epsilon \sqrt{ 1 - \beta_2^t }` to be constant, whereas [1]_
assumes :math:`\epsilon = \hat{\epsilon} \sqrt{ 1 - \beta_2^t }` to be constant.

When `optimize_each_step` is set to `True`, the weights are optimized at every
time step. If set to `False`, optimization occurs once per spike, resulting in a
significant speed-up. For gradient descent, both settings yield the same
results under exact arithmetic; however, small numerical differences may be
observed due to floating point precision. For the Adam optimizer, only setting
`optimize_each_step` to `True` precisely implements the algorithm as described
in [2]_. The impact of this setting on learning performance may vary depending
on the task.

Parameters
++++++++++

The following parameters can be set in the status dictionary.

====================== ==== ========================= ========= =================================
**Common optimizer parameters**
-------------------------------------------------------------------------------------------------
Parameter              Unit Math equivalent           Default   Description
====================== ==== ========================= ========= =================================
``batch_size``                                              1   Size of batch
``eta``                     :math:`\eta`                 1e-4   Learning rate
``optimize_each_step``                                 ``True``
``Wmax``                pA  :math:`W_{ji}^\text{max}`   100.0   Maximal value for synaptic weight
``Wmin``                pA  :math:`W_{ji}^\text{min}`  -100.0   Minimal value for synaptic weight
====================== ==== ========================= ========= =================================

========= ==== =============== ================== ==============
**Gradient descent parameters (default optimizer)**
----------------------------------------------------------------
Parameter Unit Math equivalent Default            Description
========= ==== =============== ================== ==============
``type``                       "gradient_descent" Optimizer type
========= ==== =============== ================== ==============

=========== ==== ================ ======= =================================================
**Adam optimizer parameters**
-------------------------------------------------------------------------------------------
Parameter   Unit Math equivalent  Default Description
=========== ==== ================ ======= =================================================
``type``                           "adam" Optimizer type
``beta_1``       :math:`\beta_1`      0.9 Exponential decay rate for first moment estimate
``beta_2``       :math:`\beta_2`    0.999 Exponential decay rate for second moment estimate
``epsilon``      :math:`\epsilon`    1e-7 Small constant for numerical stability
=========== ==== ================ ======= =================================================

The following state variables evolve during simulation.

============== ==== =============== ============= ==========================
**Adam optimizer state variables for individual synapses**
----------------------------------------------------------------------------
State variable Unit Math equivalent Initial value Description
============== ==== =============== ============= ==========================
``m``               :math:`m`                 0.0 First moment estimate
``v``               :math:`v`                 0.0 Second moment raw estimate
============== ==== =============== ============= ==========================


References
++++++++++

.. [1] Huh D, Sejnowski TJ (2018). Gradient descent for spiking neural networks.
       Advances in Neural Information Processing Systems, 31:1433-1443.
       https://proceedings.neurips.cc/paper_files/paper/2018/hash/185e65bc40581880c4f2c82958de8cfe-Abstract.html

.. [2] Kingma DP, Ba JL (2015). Adam: A method for stochastic optimization.
       Proceedings of 3rd International Conference for Learning Representations (ICLR).
       https://doi.org/10.48550/arXiv.1412.6980

.. [3] https://github.com/keras-team/keras/blob/v2.15.0/keras/optimizers/adam.py#L26-L220

See also
++++++++

Examples using this model
+++++++++++++++++++++++++

.. listexamples:: eprop_synapse_bsshslm_2020