Keras 3 API documentation / KerasNLP / Models / DistilBert / DistilBertTokenizer

DistilBertTokenizer

[source]

DistilBertTokenizer class

keras_nlp.models.DistilBertTokenizer(vocabulary, lowercase=False, **kwargs)

A DistilBERT tokenizer using WordPiece subword segmentation.

This tokenizer class will tokenize raw strings into integer sequences and is based on keras_nlp.tokenizers.WordPieceTokenizer. Unlike the underlying tokenizer, it will check for all special tokens needed by DistilBERT models and provides a from_preset() method to automatically download a matching vocabulary for a DistilBERT preset.

This tokenizer does not provide truncation or padding of inputs. It can be combined with a keras_nlp.models.DistilBertPreprocessor layer for input packing.

If input is a batch of strings (rank > 0), the layer will output a tf.RaggedTensor where the last dimension of the output is ragged.

If input is a scalar string (rank == 0), the layer will output a dense tf.Tensor with static shape [None].

Arguments

  • vocabulary: A list of strings or a string filename path. If passing a list, each element of the list should be a single word piece token string. If passing a filename, the file should be a plain text file containing a single word piece token per line.
  • lowercase: If True, the input text will be first lowered before tokenization.

Examples

# Unbatched input.
tokenizer = keras_nlp.models.DistilBertTokenizer.from_preset(
    "distil_bert_base_en_uncased",
)
tokenizer("The quick brown fox jumped.")

# Batched input.
tokenizer(["The quick brown fox jumped.", "The fox slept."])

# Detokenization.
tokenizer.detokenize(tokenizer("The quick brown fox jumped."))

# Custom vocabulary.
vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
vocab += ["The", "quick", "brown", "fox", "jumped", "."]
tokenizer = keras_nlp.models.DistilBertTokenizer(vocabulary=vocab)
tokenizer("The quick brown fox jumped.")

[source]

from_preset method

DistilBertTokenizer.from_preset()

Instantiate DistilBertTokenizer tokenizer from preset vocabulary.

Arguments

  • preset: string. Must be one of "distil_bert_base_en_uncased", "distil_bert_base_en", "distil_bert_base_multi".

Examples

# Load a preset tokenizer.
tokenizer = DistilBertTokenizer.from_preset("distil_bert_base_en_uncased")

# Tokenize some input.
tokenizer("The quick brown fox tripped.")

# Detokenize some input.
tokenizer.detokenize([5, 6, 7, 8, 9])
Preset name Parameters Description
distil_bert_base_en_uncased 66.36M 6-layer DistilBERT model where all input is lowercased. Trained on English Wikipedia + BooksCorpus using BERT as the teacher model.
distil_bert_base_en 65.19M 6-layer DistilBERT model where case is maintained. Trained on English Wikipedia + BooksCorpus using BERT as the teacher model.
distil_bert_base_multi 134.73M 6-layer DistilBERT model where case is maintained. Trained on Wikipedias of 104 languages