SGD
classtf.keras.optimizers.SGD(
learning_rate=0.01,
momentum=0.0,
nesterov=False,
amsgrad=False,
weight_decay=None,
clipnorm=None,
clipvalue=None,
global_clipnorm=None,
use_ema=False,
ema_momentum=0.99,
ema_overwrite_frequency=None,
jit_compile=True,
name="SGD",
**kwargs
)
Gradient descent (with momentum) optimizer.
Update rule for parameter w
with gradient g
when momentum
is 0:
w = w - learning_rate * g
Update rule when momentum
is larger than 0:
velocity = momentum * velocity - learning_rate * g
w = w + velocity
When nesterov=True
, this rule becomes:
velocity = momentum * velocity - learning_rate * g
w = w + momentum * velocity - learning_rate * g
Arguments
Tensor
, floating point value, or a schedule that is a
tf.keras.optimizers.schedules.LearningRateSchedule
, or a callable
that takes no arguments and returns the actual value to use. The
learning rate. Defaults to 0.001.False
.use_ema=True
. This is # noqa: E501
the momentum to use when computing the EMA of the model's weights:
new_average = ema_momentum * old_average + (1 - ema_momentum) *
current_variable_value
.use_ema=True
. Every ema_overwrite_frequency
steps of iterations, we
overwrite the model variable by its moving average. If None, the optimizer # noqa: E501
does not overwrite model variables in the middle of training, and you
need to explicitly overwrite the variables at the end of training
by calling optimizer.finalize_variable_values()
(which updates the model # noqa: E501
variables in-place). When using the built-in fit()
training loop, this
happens automatically after the last epoch, and you don't need to do
anything.Usage:
>>> opt = tf.keras.optimizers.experimental.SGD(learning_rate=0.1)
>>> var = tf.Variable(1.0)
>>> loss = lambda: (var ** 2)/2.0 # d(loss)/d(var1) = var1
>>> opt.minimize(loss, [var])
>>> # Step is `- learning_rate * grad`
>>> var.numpy()
0.9
>>> opt = tf.keras.optimizers.experimental.SGD(0.1, momentum=0.9)
>>> var = tf.Variable(1.0)
>>> val0 = var.value()
>>> loss = lambda: (var ** 2)/2.0 # d(loss)/d(var1) = var1
>>> # First step is `- learning_rate * grad`
>>> opt.minimize(loss, [var])
>>> val1 = var.value()
>>> (val0 - val1).numpy()
0.1
>>> # On later steps, step-size increases because of momentum
>>> opt.minimize(loss, [var])
>>> val2 = var.value()
>>> (val1 - val2).numpy()
0.18
Reference
nesterov=True
, See Sutskever et al., 2013.