Keras 3 API documentation / KerasNLP / Models / XLMRoberta / XLMRobertaTokenizer



XLMRobertaTokenizer class

keras_nlp.models.XLMRobertaTokenizer(proto, **kwargs)

An XLM-RoBERTa tokenizer using SentencePiece subword segmentation.

This tokenizer class will tokenize raw strings into integer sequences and is based on keras_nlp.tokenizers.SentencePieceTokenizer. Unlike the underlying tokenizer, it will check for all special tokens needed by XLM-RoBERTa models and provides a from_preset() method to automatically download a matching vocabulary for an XLM-RoBERTa preset.

Note: If you are providing your own custom SentencePiece model, the original fairseq implementation of XLM-RoBERTa re-maps some token indices from the underlying sentencepiece output. To preserve compatibility, we do the same re-mapping here.

If input is a batch of strings (rank > 0), the layer will output a tf.RaggedTensor where the last dimension of the output is ragged.

If input is a scalar string (rank == 0), the layer will output a dense tf.Tensor with static shape [None].


  • proto: Either a string path to a SentencePiece proto file or a bytes object with a serialized SentencePiece proto. See the SentencePiece repository for more details on the format.


tokenizer = keras_nlp.models.XLMRobertaTokenizer.from_preset(

# Unbatched inputs.
tokenizer("the quick brown fox")

# Batched inputs.
tokenizer(["the quick brown fox", "الأرض كروية"])

# Detokenization.
tokenizer.detokenize(tokenizer("the quick brown fox"))

# Custom vocabulary
def train_sentencepiece(ds, vocab_size):
    bytes_io = io.BytesIO()
    return bytes_io.getvalue()

ds =
    ["the quick brown fox", "the earth is round"]
proto = train_sentencepiece(ds, vocab_size=10)
tokenizer = keras_nlp.models.XLMRobertaTokenizer(proto=proto)


from_preset method


Instantiate XLMRobertaTokenizer tokenizer from preset vocabulary.


  • preset: string. Must be one of "xlm_roberta_base_multi", "xlm_roberta_large_multi".


# Load a preset tokenizer.
tokenizer = XLMRobertaTokenizer.from_preset("xlm_roberta_base_multi")

# Tokenize some input.
tokenizer("The quick brown fox tripped.")

# Detokenize some input.
tokenizer.detokenize([5, 6, 7, 8, 9])
Preset name Parameters Description
xlm_roberta_base_multi 277.45M 12-layer XLM-RoBERTa model where case is maintained. Trained on CommonCrawl in 100 languages.
xlm_roberta_large_multi 558.84M 24-layer XLM-RoBERTa model where case is maintained. Trained on CommonCrawl in 100 languages.