Keras 3 API documentation / KerasNLP / Modeling Layers / TransformerEncoder layer

TransformerEncoder layer


TransformerEncoder class


Transformer encoder.

This class follows the architecture of the transformer encoder layer in the paper Attention is All You Need. Users can instantiate multiple instances of this class to stack up an encoder.

This layer will correctly compute an attention mask from an implicit Keras padding mask (for example, by passing mask_zero=True to a keras.layers.Embedding layer). See the Masking and Padding guide for more details.


  • intermediate_dim: int, the hidden size of feedforward network.
  • num_heads: int, the number of heads in the keras.layers.MultiHeadAttention layer.
  • dropout: float. the dropout value, shared by keras.layers.MultiHeadAttention and feedforward network. Defaults to 0..
  • activation: string or keras.activations. the activation function of feedforward network. Defaults to "relu".
  • layer_norm_epsilon: float. The epsilon value in layer normalization components. Defaults to 1e-5.
  • kernel_initializer: string or keras.initializers initializer. The kernel initializer for the dense and multiheaded attention layers. Defaults to "glorot_uniform".
  • bias_initializer: string or keras.initializers initializer. The bias initializer for the dense and multiheaded attention layers. Defaults to "zeros".
  • normalize_first: bool. If True, the inputs to the attention layer and the intermediate dense layer are normalized (similar to GPT-2). If set to False, outputs of attention layer and intermediate dense layer are normalized (similar to BERT). Defaults to False.
  • name: string. The name of the layer. Defaults to None.
  • **kwargs: other keyword arguments.


# Create a single transformer encoder layer.
encoder = keras_nlp.layers.TransformerEncoder(
    intermediate_dim=64, num_heads=8)

# Create a simple model containing the encoder.
input = keras.Input(shape=(10, 64))
output = encoder(input)
model = keras.Model(inputs=input, outputs=output)

# Call encoder on the inputs.
input_data = np.random.uniform(size=(2, 10, 64))
output = model(input_data)



call method, padding_mask=None, attention_mask=None)

Forward pass of the TransformerEncoder.


  • inputs: a Tensor. The input data to TransformerEncoder, should be of shape [batch_size, sequence_length, hidden_dim].
  • padding_mask: a boolean Tensor. It indicates if the token should be masked because the token is introduced due to padding. padding_mask should have shape [batch_size, sequence_length].
  • attention_mask: a boolean Tensor. Customized mask used to mask out certain tokens. attention_mask should have shape [batch_size, sequence_length, sequence_length].


A Tensor of the same shape as the inputs.