Keras 3 API documentation / KerasNLP / Models / Roberta / RobertaTokenizer

RobertaTokenizer

[source]

RobertaTokenizer class

keras_nlp.models.RobertaTokenizer(vocabulary=None, merges=None, **kwargs)

A RoBERTa tokenizer using Byte-Pair Encoding subword segmentation.

This tokenizer class will tokenize raw strings into integer sequences and is based on keras_nlp.tokenizers.BytePairTokenizer. Unlike the underlying tokenizer, it will check for all special tokens needed by RoBERTa models and provides a from_preset() method to automatically download a matching vocabulary for a RoBERTa preset.

This tokenizer does not provide truncation or padding of inputs. It can be combined with a keras_nlp.models.RobertaPreprocessor layer for input packing.

If input is a batch of strings (rank > 0), the layer will output a tf.RaggedTensor where the last dimension of the output is ragged.

If input is a scalar string (rank == 0), the layer will output a dense tf.Tensor with static shape [None].

Arguments

  • vocabulary: A dictionary mapping tokens to integer ids, or file path to a json file containing the token to id mapping.
  • merges: A list of merge rules or a string file path, If passing a file path. the file should have one merge rule per line. Every merge rule contains merge entities separated by a space.

Examples

# Unbatched input.
tokenizer = keras_nlp.models.RobertaTokenizer.from_preset(
    "roberta_base_en",
)
tokenizer("The quick brown fox jumped.")

# Batched input.
tokenizer(["The quick brown fox jumped.", "The fox slept."])

# Detokenization.
tokenizer.detokenize(tokenizer("The quick brown fox jumped."))

# Custom vocabulary.
# Note: 'Ġ' is space
vocab = {"<s>": 0, "<pad>": 1, "</s>": 2, "<mask>": 3}
vocab = {**vocab, "a": 4, "Ġquick": 5, "Ġfox": 6}
merges = ["Ġ q", "u i", "c k", "ui ck", "Ġq uick"]
merges += ["Ġ f", "o x", "Ġf ox"]
tokenizer = keras_nlp.models.RobertaTokenizer(
    vocabulary=vocab,
    merges=merges
)
tokenizer(["a quick fox", "a fox quick"])

[source]

from_preset method

RobertaTokenizer.from_preset()

Instantiate RobertaTokenizer tokenizer from preset vocabulary.

Arguments

  • preset: string. Must be one of "roberta_base_en", "roberta_large_en".

Examples

# Load a preset tokenizer.
tokenizer = RobertaTokenizer.from_preset("roberta_base_en")

# Tokenize some input.
tokenizer("The quick brown fox tripped.")

# Detokenize some input.
tokenizer.detokenize([5, 6, 7, 8, 9])
Preset name Parameters Description
roberta_base_en 124.05M 12-layer RoBERTa model where case is maintained.Trained on English Wikipedia, BooksCorpus, CommonCraw, and OpenWebText.
roberta_large_en 354.31M 24-layer RoBERTa model where case is maintained.Trained on English Wikipedia, BooksCorpus, CommonCraw, and OpenWebText.