ยป Keras API reference / KerasNLP / Layers / MLMHead layer

MLMHead layer

[source]

MLMHead class

keras_nlp.layers.MLMHead(
    vocabulary_size=None,
    embedding_weights=None,
    intermediate_activation="relu",
    activation=None,
    layer_norm_epsilon=1e-05,
    kernel_initializer="glorot_uniform",
    bias_initializer="zeros",
    name=None,
    **kwargs
)

Masked Language Model (MLM) head.

This layer takes two inputs:

  • inputs: which should be a tensor of encoded tokens with shape (batch_size, sequence_length, encoding_dim).
  • mask_positions: which should be a tensor of integer positions to predict with shape (batch_size, masks_per_sequence).

The token encodings should usually be the last output of an encoder model, and mask positions should be the interger positions you would like to predict for the MLM task.

The layer will first gather the token encodings at the mask positions. These gathered tokens will be passed through a dense layer the same size as encoding dimension, then transformed to predictions the same size as the input vocabulary. This layer will produce a single output with shape (batch_size, masks_per_sequence, vocabulary_size), which can be used to compute an MLM loss function.

This layer is often be paired with keras_nlp.layers.MLMMaskGenerator, which will help prepare inputs for the MLM task.

Arguments

  • vocabulary_size: The total size of the vocabulary for predictions.
  • embedding_weights: Optional. The weights of the word embedding used to transform input token ids. The transpose of this weight matrix will be used to project a token embedding vector to a prediction over all input words, as described here.
  • intermediate_activation: The activation function of inner dense layer.
  • activation: The activation function for the outputs of the layer. Usually either None (return logits), or "softmax" (return probabilities).
  • layer_norm_epsilon: float, defaults to 1e-5. The epsilon value in layer normalization components.
  • kernel_initializer: string or keras.initializers initializer, defaults to "glorot_uniform". The kernel initializer for the dense and multiheaded attention layers.
  • bias_initializer: string or keras.initializers initializer, defaults to "zeros". The bias initializer for the dense and multiheaded attention layers.
  • name: string, defaults to None. The name of the layer.
  • **kwargs: other keyword arguments.

Examples

batch_size = 32
vocab_size = 100
encoding_size = 32
seq_length = 50
mask_length = 10

# Generate a random encoding.
encoded_tokens = tf.random.normal([batch_size, seq_length, encoding_size])
# Generate random positions and labels
mask_positions = tf.random.uniform(
    [batch_size, mask_length], maxval=seq_length, dtype="int32"
)
mask_ids = tf.random.uniform(
    [batch_size, mask_length], maxval=vocab_size, dtype="int32"
)

# Predict an output word for each masked input token.
mask_preds = keras_nlp.layers.MLMHead(
    vocabulary_size=vocab_size,
    activation="softmax",
)(encoded_tokens, mask_positions=mask_positions)
# Calculate a loss.
keras.losses.sparse_categorical_crossentropy(mask_ids, mask_preds)

References