» Keras API reference / KerasNLP / Tokenizers / UnicodeCharacterTokenizer

UnicodeCharacterTokenizer

[source]

UnicodeCharacterTokenizer class

keras_nlp.tokenizers.UnicodeCharacterTokenizer(
    sequence_length: int = None,
    lowercase: bool = True,
    normalization_form: str = None,
    errors: str = "replace",
    replacement_char: int = 65533,
    input_encoding: str = "UTF-8",
    output_encoding: str = "UTF-8",
    vocabulary_size: int = None,
    **kwargs
)

A unicode character tokenizer layer.

This tokenizer is a vocabulary free tokenizer which tokenizes text as unicode characters codepoints.

Tokenizer outputs can either be padded and truncated with a sequence_length argument, or left un-truncated. The exact output will depend on the rank of the input tensors.

If input is a batch of strings (rank > 0): By default, the layer will output a tf.RaggedTensor where the last dimension of the output is ragged. If sequence_length is set, the layer will output a dense tf.Tensor where all inputs have been padded or truncated to sequence_length.

If input is a scalar string (rank == 0): By default, the layer will output a dense tf.Tensor with static shape [None]. If sequence_length is set, the output will be a dense tf.Tensor of shape [sequence_length].

The output dtype can be controlled via the dtype argument, which should be an integer type (tf.int16, tf.int32, etc.).

Arguments

  • lowercase: If true, the input text will be first lowered before tokenization.
  • sequence_length: If set, the output will be converted to a dense tensor and padded/trimmed so all outputs are of sequence_length.
  • normalization_form: One of the following string values (None, 'NFC', 'NFKC', 'NFD', 'NFKD'). If set will normalize unicode to the given form before tokenizing.
  • errors: One of ('replace', 'remove', 'strict'). Specifies the detokenize() behavior when an invalid codepoint is encountered. (same behavior as https://www.tensorflow.org/api_docs/python/tf/strings/unicode_transcode)
  • replacement_char: The unicode codepoint to use in place of invalid codepoints. Defaults to 65533 (U+FFFD).
  • input_encoding: One of ("UTF-8", "UTF-16-BE", or "UTF-32-BE"). One of The encoding of the input text. Defaults to "UTF-8".
  • output_encoding: One of ("UTF-8", "UTF-16-BE", or "UTF-32-BE"). The encoding of the output text. Defaults to "UTF-8".
  • vocabulary_size: Set the vocabulary vocabulary_size, by clamping all codepoints to the range [0, vocabulary_size). Effectively this will make the vocabulary_size - 1 id the the OOV value.

Examples

Basic Usage.

>>> inputs = "Unicode Tokenizer"
>>> tokenizer = keras_nlp.tokenizers.UnicodeCharacterTokenizer()
>>> tokenizer(inputs)
<tf.Tensor: shape=(17,), dtype=int32, numpy=
array([117, 110, 105,  99, 111, 100, 101,  32, 116, 111, 107, 101, 110,
    105, 122, 101, 114], dtype=int32)>

Ragged outputs.

>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCharacterTokenizer()
>>> tokenizer(inputs)
<tf.RaggedTensor [[98, 111, 111, 107],
    [2346, 2369, 2360, 2381, 2340, 2325],
    [1705, 1578, 1575, 1576]]>

Dense outputs.

>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCharacterTokenizer(
...     sequence_length=8)
>>> tokenizer(inputs)
<tf.Tensor: shape=(3, 8), dtype=int32, numpy=
array([[  98,  111,  111,  107,    0,    0,    0,    0],
    [2346, 2369, 2360, 2381, 2340, 2325,    0,    0],
    [1705, 1578, 1575, 1576,    0,    0,    0,    0]], dtype=int32)>

Tokenize, then batch for ragged outputs.

>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCharacterTokenizer()
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.map(tokenizer)
>>> ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(3))
>>> ds.take(1).get_single_element()
<tf.RaggedTensor [[98, 111, 111, 107],
    [2346, 2369, 2360, 2381, 2340, 2325],
    [1705, 1578, 1575, 1576]]>

Batch, then tokenize for ragged outputs.

>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCharacterTokenizer()
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.batch(3).map(tokenizer)
>>> ds.take(1).get_single_element()
<tf.RaggedTensor [[98, 111, 111, 107],
    [2346, 2369, 2360, 2381, 2340, 2325],
    [1705, 1578, 1575, 1576]]>

Tokenize, then batch for dense outputs (sequence_length provided).

>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCharacterTokenizer(
...     sequence_length=5)
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.map(tokenizer)
>>> ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(3))
>>> ds.take(1).get_single_element()
<tf.Tensor: shape=(3, 5), dtype=int32, numpy=
array([[  98,  111,  111,  107,    0],
    [2346, 2369, 2360, 2381, 2340],
    [1705, 1578, 1575, 1576,    0]], dtype=int32)>

Batch, then tokenize for dense outputs (sequence_length provided).

>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCharacterTokenizer(
...     sequence_length=5)
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.batch(3).map(tokenizer)
>>> ds.take(1).get_single_element()
<tf.Tensor: shape=(3, 5), dtype=int32, numpy=
array([[  98,  111,  111,  107,    0],
    [2346, 2369, 2360, 2381, 2340],
    [1705, 1578, 1575, 1576,    0]], dtype=int32)>

Tokenization with truncation.

>>> inputs = ["I Like to Travel a Lot", "मैं किताबें पढ़ना पसंद करता हूं"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCharacterTokenizer(
...     sequence_length=5)
>>> tokenizer(inputs)
<tf.Tensor: shape=(5,), dtype=int32,
    numpy=array([[ 105,   32,  108,  105,  107],
   [2350, 2376, 2306,   32, 2325]], dtype=int32)>

Tokenization with vocabulary_size.

>>> latin_ext_cutoff = 592
>>> tokenizer = keras_nlp.tokenizers.UnicodeCharacterTokenizer(
...     vocabulary_size=latin_ext_cutoff)
>>> tokenizer("¿Cómo estás?")
<tf.Tensor: shape=(10,), dtype=int32,
numpy=array([191,  99, 243, 109, 111,  32, 101, 115, 116, 225, 115,  63],
dtype=int32)>
>>> tokenizer("आप कैसे हैं")
<tf.Tensor: shape=(11,), dtype=int32,
numpy=array([591, 591,  32, 591, 591, 591, 591,  32, 591, 591, 591],
dtype=int32)>

Detokenization.

>>> inputs = tf.constant([110, 105, 110, 106,  97], dtype=tf.int32)
>>> tokenizer = keras_nlp.tokenizers.UnicodeCharacterTokenizer()
>>> tokenizer.detokenize(inputs)
<tf.Tensor: shape=(), dtype=string, numpy=b'ninja'>

Detokenization with padding.

>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCharacterTokenizer()
>>> tokenizer(inputs)
<tf.RaggedTensor [[98, 111, 111, 107],
    [2346, 2369, 2360, 2381, 2340, 2325],
    [1705, 1578, 1575, 1576]]>
```0

Detokenization with invalid bytes.
```python
>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_nlp.tokenizers.UnicodeCharacterTokenizer()
>>> tokenizer(inputs)
<tf.RaggedTensor [[98, 111, 111, 107],
    [2346, 2369, 2360, 2381, 2340, 2325],
    [1705, 1578, 1575, 1576]]>
```1


----

<span style="float:right;">[[source]](https://github.com/keras-team/keras-nlp/tree/v0.3.1//keras_nlp/tokenizers/unicode_character_tokenizer.py#L275)</span>

### `tokenize` method


```python
UnicodeCharacterTokenizer.tokenize(inputs)

Transform input tensors of strings into output tokens.

Arguments

  • inputs: Input tensor, or dict/list/tuple of input tensors.
  • *args: Additional positional arguments.
  • **kwargs: Additional keyword arguments.

[source]

detokenize method

UnicodeCharacterTokenizer.detokenize(inputs)

Transform tokens back into strings.

Arguments

  • inputs: Input tensor, or dict/list/tuple of input tensors.
  • *args: Additional positional arguments.
  • **kwargs: Additional keyword arguments.

[source]

get_vocabulary method

UnicodeCharacterTokenizer.get_vocabulary()

Get the tokenizer vocabulary as a list of strings terms.


[source]

vocabulary_size method

UnicodeCharacterTokenizer.vocabulary_size()

Get the size of the tokenizer vocabulary. None implies no vocabulary size was provided


[source]

token_to_id method

UnicodeCharacterTokenizer.token_to_id(token: str)

Convert an integer id to a string token.


[source]

id_to_token method

UnicodeCharacterTokenizer.id_to_token(id: int)

Convert an integer id to a string token.