ยป Keras API reference / KerasNLP / Tokenizers / ByteTokenizer

ByteTokenizer

[source]

ByteTokenizer class

keras_nlp.tokenizers.ByteTokenizer(
    lowercase: bool = True,
    sequence_length: int = None,
    normalization_form: str = None,
    errors: str = "replace",
    replacement_char: int = 65533,
    **kwargs
)

Raw byte tokenizer.

This tokenizer is a vocabulary-free tokenizer which will tokenize text as as raw bytes from [0, 256).

Tokenizer outputs can either be padded and truncated with a sequence_length argument, or left un-truncated. The exact output will depend on the rank of the input tensors.

If input is a batch of strings: By default, the layer will output a tf.RaggedTensor where the last dimension of the output is ragged. If sequence_length is set, the layer will output a dense tf.Tensor where all inputs have been padded or truncated to sequence_length.

If input is a scalar string: There are two cases here. If sequence_length is set, the output will be a dense tf.Tensor of shape [sequence_length]. Otherwise, the output will be a dense tf.Tensor of shape [None].

The output dtype can be controlled via the dtype argument, which should be an integer type (tf.int16, tf.int32, etc.).

Arguments

  • lowercase: boolean. If True, the input text will be converted to lowercase before tokenization.
  • sequence_length: int. If set, the output will be converted to a dense tensor and padded/trimmed so all outputs are of sequence_length.
  • normalization_form: string. One of the following values: (None, "NFC", "NFKC", "NFD", "NFKD"). If set, every UTF-8 string in the input tensor text will be normalized to the given form before tokenizing.
  • errors: string. One of ("strict", "replace", "ignore"). Defaults to "replace". Specifies the detokenize() behaviour when an invalid byte sequence is encountered (same behaviour as https://www.tensorflow.org/api_docs/python/tf/strings/unicode_transcode).
  • replacement_char: int. Defaults to 65533. The replacement character to use when an invalid byte sequence is encountered and when errors is set to "replace" (same behaviour as https://www.tensorflow.org/api_docs/python/tf/strings/unicode_transcode).

Examples

Basic usage.

>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer()
>>> tokenizer("hello")
<tf.Tensor: shape=(5,), dtype=int32, numpy=
array([104, 101, 108, 108, 111], dtype=int32)>

Ragged outputs.

>>> inputs = tf.constant(["hello", "hi"])
>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer()
>>> tokenizer(inputs)
<tf.RaggedTensor [[104, 101, 108, 108, 111], [104, 105]]>

Dense outputs.

>>> inputs = tf.constant(["hello", "hi"])
>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer(sequence_length=8)
>>> tokenizer(inputs)
<tf.Tensor: shape=(2, 8), dtype=int32, numpy=
array([[104, 101, 108, 108, 111,   0,   0,   0],
       [104, 105,   0,   0,   0,   0,   0,   0]], dtype=int32)>

Dense outputs.

>>> inputs = tf.constant(["hello"])
>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer(sequence_length=8)
>>> tokenizer(inputs)
<tf.Tensor: shape=(1, 8), dtype=int32, numpy=
array([[104, 101, 108, 108, 111,   0,   0,   0]], dtype=int32)>

Tokenize, then batch for ragged outputs.

>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer()
>>> ds = tf.data.Dataset.from_tensor_slices(["hello", "fun"])
>>> ds = ds.map(tokenizer)
>>> ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(2))
>>> ds.take(1).get_single_element()
<tf.RaggedTensor [[104, 101, 108, 108, 111], [102, 117, 110]]>

Batch, then tokenize for ragged outputs.

>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer()
>>> ds = tf.data.Dataset.from_tensor_slices(["hello", "fun"])
>>> ds = ds.batch(2).map(tokenizer)
>>> ds.take(1).get_single_element()
<tf.RaggedTensor [[104, 101, 108, 108, 111], [102, 117, 110]]>

Tokenize, then batch for dense outputs (sequence_length provided).

>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer(sequence_length=5)
>>> ds = tf.data.Dataset.from_tensor_slices(["hello", "fun"])
>>> ds = ds.map(tokenizer)
>>> ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(2))
>>> ds.take(1).get_single_element()
<tf.Tensor: shape=(2, 5), dtype=int32, numpy=
array([[104, 101, 108, 108, 111],
       [102, 117, 110,   0,   0]], dtype=int32)>

Batch, then tokenize for dense outputs. (sequence_length provided).

>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer(sequence_length=5)
>>> ds = tf.data.Dataset.from_tensor_slices(["hello", "fun"])
>>> ds = ds.batch(2).map(tokenizer)
>>> ds.take(1).get_single_element()
<tf.Tensor: shape=(2, 5), dtype=int32, numpy=
array([[104, 101, 108, 108, 111],
       [102, 117, 110,   0,   0]], dtype=int32)>

Detokenization.

>>> inputs = tf.constant([104, 101, 108, 108, 111], dtype=tf.int32)
>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer()
>>> tokenizer.detokenize(inputs)
<tf.Tensor: shape=(), dtype=string, numpy=b'hello'>

Detokenization with invalid bytes.

>>> # The 255 below is invalid utf-8.
>>> inputs = tf.constant([104, 101, 255, 108, 108, 111], dtype=tf.int32)
>>> tokenizer = keras_nlp.tokenizers.ByteTokenizer(
...     errors="replace", replacement_char=88)
>>> tokenizer.detokenize(inputs).numpy().decode('utf-8')
'heXllo'

[source]

tokenize method

ByteTokenizer.tokenize(inputs)

Transform input tensors of strings into output tokens.

Arguments

  • inputs: Input tensor, or dict/list/tuple of input tensors.
  • *args: Additional positional arguments.
  • **kwargs: Additional keyword arguments.

[source]

detokenize method

ByteTokenizer.detokenize(inputs)

Transform tokens back into strings.

Arguments

  • inputs: Input tensor, or dict/list/tuple of input tensors.
  • *args: Additional positional arguments.
  • **kwargs: Additional keyword arguments.

[source]

get_vocabulary method

ByteTokenizer.get_vocabulary()

Get the tokenizer vocabulary as a list of strings terms.


[source]

vocabulary_size method

ByteTokenizer.vocabulary_size()

Get the size of the tokenizer vocabulary.


[source]

token_to_id method

ByteTokenizer.token_to_id(token: str)

Convert an integer id to a string token.


[source]

id_to_token method

ByteTokenizer.id_to_token(id: int)

Convert an integer id to a string token.