Keras 3 API documentation / KerasNLP / Models / Bert / BertTokenizer

BertTokenizer

[source]

BertTokenizer class

keras_nlp.models.BertTokenizer(vocabulary=None, lowercase=False, **kwargs)

A BERT tokenizer using WordPiece subword segmentation.

This tokenizer class will tokenize raw strings into integer sequences and is based on keras_nlp.tokenizers.WordPieceTokenizer. Unlike the underlying tokenizer, it will check for all special tokens needed by BERT models and provides a from_preset() method to automatically download a matching vocabulary for a BERT preset.

This tokenizer does not provide truncation or padding of inputs. It can be combined with a keras_nlp.models.BertPreprocessor layer for input packing.

If input is a batch of strings (rank > 0), the layer will output a tf.RaggedTensor where the last dimension of the output is ragged.

If input is a scalar string (rank == 0), the layer will output a dense tf.Tensor with static shape [None].

Arguments

  • vocabulary: A list of strings or a string filename path. If passing a list, each element of the list should be a single word piece token string. If passing a filename, the file should be a plain text file containing a single word piece token per line.
  • lowercase: If True, the input text will be first lowered before tokenization.

Examples

# Unbatched input.
tokenizer = keras_nlp.models.BertTokenizer.from_preset(
    "bert_base_en_uncased",
)
tokenizer("The quick brown fox jumped.")

# Batched input.
tokenizer(["The quick brown fox jumped.", "The fox slept."])

# Detokenization.
tokenizer.detokenize(tokenizer("The quick brown fox jumped."))

# Custom vocabulary.
vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
vocab += ["The", "quick", "brown", "fox", "jumped", "."]
tokenizer = keras_nlp.models.BertTokenizer(vocabulary=vocab)
tokenizer("The quick brown fox jumped.")

[source]

from_preset method

BertTokenizer.from_preset()

Instantiate BertTokenizer tokenizer from preset vocabulary.

Arguments

  • preset: string. Must be one of "bert_tiny_en_uncased", "bert_small_en_uncased", "bert_medium_en_uncased", "bert_base_en_uncased", "bert_base_en", "bert_base_zh", "bert_base_multi", "bert_large_en_uncased", "bert_large_en", "bert_tiny_en_uncased_sst2".

Examples

# Load a preset tokenizer.
tokenizer = BertTokenizer.from_preset("bert_tiny_en_uncased")

# Tokenize some input.
tokenizer("The quick brown fox tripped.")

# Detokenize some input.
tokenizer.detokenize([5, 6, 7, 8, 9])
Preset name Parameters Description
bert_tiny_en_uncased 4.39M 2-layer BERT model where all input is lowercased. Trained on English Wikipedia + BooksCorpus.
bert_small_en_uncased 28.76M 4-layer BERT model where all input is lowercased. Trained on English Wikipedia + BooksCorpus.
bert_medium_en_uncased 41.37M 8-layer BERT model where all input is lowercased. Trained on English Wikipedia + BooksCorpus.
bert_base_en_uncased 109.48M 12-layer BERT model where all input is lowercased. Trained on English Wikipedia + BooksCorpus.
bert_base_en 108.31M 12-layer BERT model where case is maintained. Trained on English Wikipedia + BooksCorpus.
bert_base_zh 102.27M 12-layer BERT model. Trained on Chinese Wikipedia.
bert_base_multi 177.85M 12-layer BERT model where case is maintained. Trained on trained on Wikipedias of 104 languages
bert_large_en_uncased 335.14M 24-layer BERT model where all input is lowercased. Trained on English Wikipedia + BooksCorpus.
bert_large_en 333.58M 24-layer BERT model where case is maintained. Trained on English Wikipedia + BooksCorpus.
bert_tiny_en_uncased_sst2 4.39M The bert_tiny_en_uncased backbone model fine-tuned on the SST-2 sentiment analysis dataset.