LayerNormalization
classtf_keras.layers.LayerNormalization(
axis=-1,
epsilon=0.001,
center=True,
scale=True,
beta_initializer="zeros",
gamma_initializer="ones",
beta_regularizer=None,
gamma_regularizer=None,
beta_constraint=None,
gamma_constraint=None,
**kwargs
)
Layer normalization layer (Ba et al., 2016).
Normalize the activations of the previous layer for each given example in a batch independently, rather than across a batch like Batch Normalization. i.e. applies a transformation that maintains the mean activation within each example close to 0 and the activation standard deviation close to 1.
Given a tensor inputs
, moments are calculated and normalization
is performed across the axes specified in axis
.
Example
>>> data = tf.constant(np.arange(10).reshape(5, 2) * 10, dtype=tf.float32)
>>> print(data)
tf.Tensor(
[[ 0. 10.]
[20. 30.]
[40. 50.]
[60. 70.]
[80. 90.]], shape=(5, 2), dtype=float32)
>>> layer = tf.keras.layers.LayerNormalization(axis=1)
>>> output = layer(data)
>>> print(output)
tf.Tensor(
[[-1. 1.]
[-1. 1.]
[-1. 1.]
[-1. 1.]
[-1. 1.]], shape=(5, 2), dtype=float32)
Notice that with Layer Normalization the normalization happens across the axes within each example, rather than across different examples in the batch.
If scale
or center
are enabled, the layer will scale the normalized
outputs by broadcasting them with a trainable variable gamma
, and center
the outputs by broadcasting with a trainable variable beta
. gamma
will
default to a ones tensor and beta
will default to a zeros tensor, so that
centering and scaling are no-ops before training has begun.
So, with scaling and centering enabled the normalization equations are as follows:
Let the intermediate activations for a mini-batch to be the inputs
.
For each sample x_i
in inputs
with k
features, we compute the mean and
variance of the sample:
mean_i = sum(x_i[j] for j in range(k)) / k
var_i = sum((x_i[j] - mean_i) ** 2 for j in range(k)) / k
and then compute a normalized x_i_normalized
, including a small factor
epsilon
for numerical stability.
x_i_normalized = (x_i - mean_i) / sqrt(var_i + epsilon)
And finally x_i_normalized
is linearly transformed by gamma
and beta
,
which are learned parameters:
output_i = x_i_normalized * gamma + beta
gamma
and beta
will span the axes of inputs
specified in axis
, and
this part of the inputs' shape must be fully defined.
For example:
>>> layer = tf.keras.layers.LayerNormalization(axis=[1, 2, 3])
>>> layer.build([5, 20, 30, 40])
>>> print(layer.beta.shape)
(20, 30, 40)
>>> print(layer.gamma.shape)
(20, 30, 40)
Note that other implementations of layer normalization may choose to define
gamma
and beta
over a separate set of axes from the axes being
normalized across. For example, Group Normalization
(Wu et al. 2018) with group size of 1
corresponds to a Layer Normalization that normalizes across height, width,
and channel and has gamma
and beta
span only the channel dimension.
So, this Layer Normalization implementation will not match a Group
Normalization layer with group size set to 1.
Arguments
-1
is the last dimension in the
input. Defaults to -1
.beta
to normalized tensor. If False,
beta
is ignored. Defaults to True
.gamma
. If False, gamma
is not used.
When the next layer is linear (also e.g. nn.relu
), this can be
disabled since the scaling will be done by the next layer.
Defaults to True
.Input shape
Arbitrary. Use the keyword argument input_shape
(tuple of
integers, does not include the samples axis) when using this layer as the
first layer in a model.
Output shape
Same shape as input.
Reference