Keras 3 API documentation / KerasNLP / Tokenizers / Tokenizer base class

Tokenizer base class

[source]

Tokenizer class

keras_nlp.tokenizers.Tokenizer()

A base class for tokenizer layers.

Tokenizers in the KerasNLP library should all subclass this layer. The class provides two core methods tokenize() and detokenize() for going from plain text to sequences and back. A tokenizer is a subclass of keras.layers.Layer and can be combined into a keras.Model.

Subclassers should always implement the tokenize() method, which will also be the default when calling the layer directly on inputs.

Subclassers can optionally implement the detokenize() method if the tokenization is reversible. Otherwise, this can be skipped.

Subclassers should implement get_vocabulary(), vocabulary_size(), token_to_id() and id_to_token() if applicable. For some simple "vocab free" tokenizers, such as a whitespace splitter show below, these methods do not apply and can be skipped.

Examples

class WhitespaceSplitterTokenizer(keras_nlp.tokenizers.Tokenizer):
    def tokenize(self, inputs):
        return tf.strings.split(inputs)

    def detokenize(self, inputs):
        return tf.strings.reduce_join(inputs, separator=" ", axis=-1)

tokenizer = WhitespaceSplitterTokenizer()

# Tokenize some inputs.
tokenizer.tokenize("This is a test")

# Shorthard for `tokenize()`.
tokenizer("This is a test")

# Detokenize some outputs.
tokenizer.detokenize(["This", "is", "a", "test"])

[source]

tokenize method

Tokenizer.tokenize(inputs, )

Transform input tensors of strings into output tokens.

Arguments

  • inputs: Input tensor, or dict/list/tuple of input tensors.
  • *args: Additional positional arguments.
  • **kwargs: Additional keyword arguments.

[source]

detokenize method

Tokenizer.detokenize(inputs, )

Transform tokens back into strings.

Arguments

  • inputs: Input tensor, or dict/list/tuple of input tensors.
  • *args: Additional positional arguments.
  • **kwargs: Additional keyword arguments.

[source]

get_vocabulary method

Tokenizer.get_vocabulary()

Get the tokenizer vocabulary as a list of strings terms.


[source]

vocabulary_size method

Tokenizer.vocabulary_size()

Returns the total size of the token id space.


[source]

token_to_id method

Tokenizer.token_to_id(token: str)

Convert a string token to an integer id.


[source]

id_to_token method

Tokenizer.id_to_token(id: int)

Convert an integer id to a string token.