Tokenizer
classkeras_nlp.tokenizers.Tokenizer()
A base class for tokenizer layers.
Tokenizers in the KerasNLP library should all subclass this layer.
The class provides two core methods tokenize()
and detokenize()
for
going from plain text to sequences and back. A tokenizer is a subclass of
keras.layers.Layer
and can be combined into a keras.Model
.
Subclassers should always implement the tokenize()
method, which will also
be the default when calling the layer directly on inputs.
Subclassers can optionally implement the detokenize()
method if the
tokenization is reversible. Otherwise, this can be skipped.
Subclassers should implement get_vocabulary()
, vocabulary_size()
,
token_to_id()
and id_to_token()
if applicable. For some simple
"vocab free" tokenizers, such as a whitespace splitter show below, these
methods do not apply and can be skipped.
Examples
class WhitespaceSplitterTokenizer(keras_nlp.tokenizers.Tokenizer):
def tokenize(self, inputs):
return tf.strings.split(inputs)
def detokenize(self, inputs):
return tf.strings.reduce_join(inputs, separator=" ", axis=-1)
tokenizer = WhitespaceSplitterTokenizer()
# Tokenize some inputs.
tokenizer.tokenize("This is a test")
# Shorthard for `tokenize()`.
tokenizer("This is a test")
# Detokenize some outputs.
tokenizer.detokenize(["This", "is", "a", "test"])
tokenize
methodTokenizer.tokenize(inputs, )
Transform input tensors of strings into output tokens.
Arguments
detokenize
methodTokenizer.detokenize(inputs, )
Transform tokens back into strings.
Arguments
get_vocabulary
methodTokenizer.get_vocabulary()
Get the tokenizer vocabulary as a list of strings terms.
vocabulary_size
methodTokenizer.vocabulary_size()
Returns the total size of the token id space.
token_to_id
methodTokenizer.token_to_id(token: str)
Convert a string token to an integer id.
id_to_token
methodTokenizer.id_to_token(id: int)
Convert an integer id to a string token.