► Keras 3 API documentation / KerasNLP / Tokenizers / Tokenizer base class

Tokenizer base class

`Tokenizer` class

keras_nlp.tokenizers.Tokenizer()

A base class for tokenizer layers.

Tokenizers in the KerasNLP library should all subclass this layer. The class provides two core methods tokenize() and detokenize() for going from plain text to sequences and back. A tokenizer is a subclass of keras.layers.Layer and can be combined into a keras.Model.

Subclassers should always implement the tokenize() method, which will also be the default when calling the layer directly on inputs.

Subclassers can optionally implement the detokenize() method if the tokenization is reversible. Otherwise, this can be skipped.

Subclassers should implement get_vocabulary(), vocabulary_size(), token_to_id() and id_to_token() if applicable. For some simple "vocab free" tokenizers, such as a whitespace splitter show below, these methods do not apply and can be skipped.

Examples

class WhitespaceSplitterTokenizer(keras_nlp.tokenizers.Tokenizer):
    def tokenize(self, inputs):
        return tf.strings.split(inputs)

    def detokenize(self, inputs):
        return tf.strings.reduce_join(inputs, separator=" ", axis=-1)

tokenizer = WhitespaceSplitterTokenizer()

# Tokenize some inputs.
tokenizer.tokenize("This is a test")

# Shorthard for `tokenize()`.
tokenizer("This is a test")

# Detokenize some outputs.
tokenizer.detokenize(["This", "is", "a", "test"])

[source]

`tokenize` method

Tokenizer.tokenize(inputs, )

Transform input tensors of strings into output tokens.

Arguments

inputs: Input tensor, or dict/list/tuple of input tensors.
*args: Additional positional arguments.
**kwargs: Additional keyword arguments.

[source]

`detokenize` method

Tokenizer.detokenize(inputs, )

Transform tokens back into strings.

Arguments

inputs: Input tensor, or dict/list/tuple of input tensors.
*args: Additional positional arguments.
**kwargs: Additional keyword arguments.

[source]

`get_vocabulary` method

Tokenizer.get_vocabulary()

Get the tokenizer vocabulary as a list of strings terms.

[source]

`vocabulary_size` method

Tokenizer.vocabulary_size()

Returns the total size of the token id space.

[source]

`token_to_id` method

Tokenizer.token_to_id(token: str)

Convert a string token to an integer id.

[source]

`id_to_token` method

Tokenizer.id_to_token(id: int)

Convert an integer id to a string token.

Tokenizer base class

Tokenizer class

tokenize method

detokenize method

get_vocabulary method

vocabulary_size method

token_to_id method

id_to_token method

Tokenizer base class

Tokenizer class

tokenize method

detokenize method

get_vocabulary method

vocabulary_size method

token_to_id method

id_to_token method

`Tokenizer` class

`tokenize` method

`detokenize` method

`get_vocabulary` method

`vocabulary_size` method

`token_to_id` method

`id_to_token` method