► KerasHub: Pretrained Models / API documentation / Modeling Layers / TransformerEncoder layer

TransformerEncoder layer

`TransformerEncoder` class

keras_hub.layers.TransformerEncoder(
    intermediate_dim,
    num_heads,
    dropout=0,
    activation="relu",
    layer_norm_epsilon=1e-05,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    normalize_first=False,
    **kwargs
)

Transformer encoder.

This class follows the architecture of the transformer encoder layer in the paper Attention is All You Need. Users can instantiate multiple instances of this class to stack up an encoder.

This layer will compute an attention mask, prioritizing explicitly provided masks (a padding_mask or a custom attention_mask) over an implicit Keras padding mask (for example, by passing mask_zero=True to a keras.layers.Embedding layer). If both a padding_mask and a attention_mask are provided, they will be combined to determine the final mask. See the Masking and Padding guide for more details.

Arguments

intermediate_dim: int, the hidden size of feedforward network.
num_heads: int, the number of heads in the keras.layers.MultiHeadAttention layer.
dropout: float. the dropout value, shared by keras.layers.MultiHeadAttention and feedforward network. Defaults to 0..
activation: string or keras.activations. the activation function of feedforward network. Defaults to "relu".
layer_norm_epsilon: float. The epsilon value in layer normalization components. Defaults to 1e-5.
kernel_initializer: string or keras.initializers initializer. The kernel initializer for the dense and multiheaded attention layers. Defaults to "glorot_uniform".
bias_initializer: string or keras.initializers initializer. The bias initializer for the dense and multiheaded attention layers. Defaults to "zeros".
normalize_first: bool. If True, the inputs to the attention layer and the intermediate dense layer are normalized (similar to GPT-2). If set to False, outputs of attention layer and intermediate dense layer are normalized (similar to BERT). Defaults to False.
**kwargs: other keyword arguments passed to keras.layers.Layer, including name, trainable, dtype etc.

Example

# Create a single transformer encoder layer.
encoder = keras_hub.layers.TransformerEncoder(
    intermediate_dim=64, num_heads=8)

# Create a simple model containing the encoder.
input = keras.Input(shape=(10, 64))
output = encoder(input)
model = keras.Model(inputs=input, outputs=output)

# Call encoder on the inputs.
input_data = np.random.uniform(size=(2, 10, 64))
output = model(input_data)

References

Vaswani et al., 2017

[source]

`call` method

TransformerEncoder.call(
    inputs,
    padding_mask=None,
    attention_mask=None,
    training=None,
    return_attention_scores=False,
)

Forward pass of the TransformerEncoder.

Arguments

inputs: a Tensor. The input data to TransformerEncoder, should be of shape [batch_size, sequence_length, hidden_dim].
padding_mask: a boolean Tensor. It indicates if the token should be masked because the token is introduced due to padding. padding_mask should have shape [batch_size, sequence_length].
attention_mask: a boolean Tensor. Customized mask used to mask out certain tokens. attention_mask should have shape [batch_size, sequence_length, sequence_length].
training: a boolean indicating whether the layer should behave in training mode or in inference mode.
return_attention_scores: a boolean indicating whether the output should be (attention_output, attention_scores) if True or attention_output if False. Defaults to False.

Returns

A Tensor of the same shape as the inputs.

TransformerEncoder layer

TransformerEncoder class

call method

TransformerEncoder layer

TransformerEncoder class

call method

`TransformerEncoder` class

`call` method