ยป Keras API reference / KerasNLP / Layers / MLMMaskGenerator layer

MLMMaskGenerator layer

[source]

MLMMaskGenerator class

keras_nlp.layers.MLMMaskGenerator(
    vocabulary_size,
    mask_selection_rate,
    mask_token_id,
    mask_selection_length=None,
    unselectable_token_ids=[0],
    mask_token_rate=0.8,
    random_token_rate=0.1,
    **kwargs
)

Layer that applies language model masking.

This layer is useful for preparing inputs for masked languaged modeling (MLM) tasks. It follows the masking strategy described in the original BERT paper. Given tokenized text, it randomly selects certain number of tokens for masking. Then for each selected token, it has a chance (configurable) to be replaced by "mask token" or random token, or stay unchanged.

Users should use this layer with tf.data to generate masks.

Arguments

  • vocabulary_size: int, the size of the vocabulary.
  • mask_selection_rate: float, the probability of a token is selected for masking.
  • mask_token_id: int. The id of mask token.
  • mask_selection_length: int, defaults to None. Maximum number of tokens selected for masking in each sequence. If set, the output mask_positions, mask_ids and mask_weights will be padded to dense tensors of length mask_selection_length, otherwise the output will be a RaggedTensor.
  • unselectable_token_ids: A list of tokens, defaults to [0] (the default padding_token_id).
  • mask_token_rate: float, defaults to 0.8. mask_token_rate must be between 0 and 1 which indicates how often the mask_token is substituted for tokens selected for masking.
  • random_token_rate: float, defaults to 0.1. random_token_rate must be between 0 and 1 which indicates how often a random token is substituted for tokens selected for masking. Default is 0.1. Note: mask_token_rate + random_token_rate <= 1, and for (1 - mask_token_rate - random_token_rate), the token will not be changed.

Input: A 1D integer tensor of shape [sequence_length] or a 2D integer tensor of shape [batch_size, sequence_length], or a 2D integer RaggedTensor. Represents the sequence to mask.

Returns

  • A Dict with 4 keys: tokens: Tensor or RaggedTensor, has the same type and shape of input. Sequence after getting masked. mask_positions: Tensor, or RaggedTensor if mask_selection_length is None. The positions of tokens getting masked. mask_ids: Tensor, or RaggedTensor if mask_selection_length is None. The original token ids at masked positions. mask_weights: Tensor, or RaggedTensor if mask_selection_length is None. mask_weights has the same shape as mask_positions and mask_ids. Each element in mask_weights should be 0 or 1, 1 means the corresponding position in mask_positions is an actual mask, 0 means it is a pad.

Examples

Basic usage.

>>> masker = keras_nlp.layers.MLMMaskGenerator(
...     vocabulary_size=10, mask_selection_rate=0.2, mask_token_id=0,
...     mask_selection_length=5)
>>> masker(tf.constant([1, 2, 3, 4, 5]))

Ragged Input:

>>> masker = keras_nlp.layers.MLMMaskGenerator(
...     vocabulary_size=10, mask_selection_rate=0.5, mask_token_id=0,
...     mask_selection_length=5)
>>> masker(tf.ragged.constant([[1, 2], [1, 2, 3, 4]]))