Keras 3 API documentation / KerasNLP / Models / XLMRoberta / XLMRobertaPreprocessor layer

XLMRobertaPreprocessor layer


XLMRobertaPreprocessor class

    tokenizer, sequence_length=512, truncate="round_robin", **kwargs

An XLM-RoBERTa preprocessing layer which tokenizes and packs inputs.

This preprocessing layer will do three things:

  1. Tokenize any number of input segments using the tokenizer.
  2. Pack the inputs together using a keras_nlp.layers.MultiSegmentPacker. with the appropriate "<s>", "</s>" and "<pad>" tokens, i.e., adding a single "<s>" at the start of the entire sequence, "</s></s>" at the end of each segment, save the last and a "</s>" at the end of the entire sequence.
  3. Construct a dictionary with keys "token_ids" and "padding_mask", that can be passed directly to an XLM-RoBERTa model.

This layer can be used directly with to preprocess string data in the (x, y, sample_weight) format used by


  • tokenizer: A keras_nlp.tokenizers.XLMRobertaTokenizer instance.
  • sequence_length: The length of the packed inputs.
  • truncate: The algorithm to truncate a list of batched segments to fit within sequence_length. The value can be either round_robin or waterfall: - "round_robin": Available space is assigned one token at a time in a round-robin fashion to the inputs that still need some, until the limit is reached. - "waterfall": The allocation of the budget is done using a "waterfall" algorithm that allocates quota in a left-to-right manner and fills up the buckets until we run out of budget. It supports an arbitrary number of segments.

Call arguments

  • x: A tensor of single string sequences, or a tuple of multiple tensor sequences to be packed together. Inputs may be batched or unbatched. For single sequences, raw python inputs will be converted to tensors. For multiple sequences, pass tensors directly.
  • y: Any label data. Will be passed through unaltered.
  • sample_weight: Any label weight data. Will be passed through unaltered.


Directly calling the layer on data.

preprocessor = keras_nlp.models.XLMRobertaPreprocessor.from_preset(

# Tokenize and pack a single sentence.
preprocessor("The quick brown fox jumped.")

# Tokenize a batch of single sentences.
preprocessor(["The quick brown fox jumped.", "اسمي اسماعيل"])

# Preprocess a batch of sentence pairs.
# When handling multiple sequences, always convert to tensors first!
first = tf.constant(["The quick brown fox jumped.", "اسمي اسماعيل"])
second = tf.constant(["The fox tripped.", "الأسد ملك الغابة"])
preprocessor((first, second))

# Custom vocabulary.
def train_sentencepiece(ds, vocab_size):
    bytes_io = io.BytesIO()
    return bytes_io.getvalue()
ds =
    ["the quick brown fox", "the earth is round"]
proto = train_sentencepiece(ds, vocab_size=10)
tokenizer = keras_nlp.models.XLMRobertaTokenizer(proto=proto)
preprocessor = keras_nlp.models.XLMRobertaPreprocessor(tokenizer)
preprocessor("The quick brown fox jumped.")

Mapping with

preprocessor = keras_nlp.models.XLMRobertaPreprocessor.from_preset(

first = tf.constant(["The quick brown fox jumped.", "Call me Ishmael."])
second = tf.constant(["The fox tripped.", "Oh look, a whale."])
label = tf.constant([1, 1])

# Map labeled single sentences.
ds =, label))
ds =,

# Map unlabeled single sentences.
ds =
ds =,

# Map labeled sentence pairs.
ds =, second), label))
ds =,

# Map unlabeled sentence pairs.
ds =, second))
# Watch out for's default unpacking of tuples here!
# Best to invoke the `preprocessor` directly in this case.
ds =
    lambda first, second: preprocessor(x=(first, second)),,


from_preset method


Instantiate XLMRobertaPreprocessor from preset architecture.


  • preset: string. Must be one of "xlm_roberta_base_multi", "xlm_roberta_large_multi".


# Load a preprocessor layer from a preset.
preprocessor = keras_nlp.models.XLMRobertaPreprocessor.from_preset(
Preset name Parameters Description
xlm_roberta_base_multi 277.45M 12-layer XLM-RoBERTa model where case is maintained. Trained on CommonCrawl in 100 languages.
xlm_roberta_large_multi 558.84M 24-layer XLM-RoBERTa model where case is maintained. Trained on CommonCrawl in 100 languages.

tokenizer property


The tokenizer used to tokenize strings.