UnicodeCodepointTokenizer
classkeras_hub.tokenizers.UnicodeCodepointTokenizer(
sequence_length=None,
lowercase=True,
normalization_form=None,
errors="replace",
replacement_char=65533,
input_encoding="UTF-8",
output_encoding="UTF-8",
vocabulary_size=None,
dtype="int32",
**kwargs
)
A unicode character tokenizer layer.
This tokenizer is a vocabulary free tokenizer which tokenizes text as unicode character codepoints.
Tokenizer outputs can either be padded and truncated with a
sequence_length
argument, or left un-truncated. The exact output will
depend on the rank of the input tensors.
If input is a batch of strings (rank > 0):
By default, the layer will output a tf.RaggedTensor
where the last
dimension of the output is ragged. If sequence_length
is set, the layer
will output a dense tf.Tensor
where all inputs have been padded or
truncated to sequence_length
.
If input is a scalar string (rank == 0):
By default, the layer will output a dense tf.Tensor
with static shape
[None]
. If sequence_length
is set, the output will be
a dense tf.Tensor
of shape [sequence_length]
.
The output dtype can be controlled via the dtype
argument, which should be
an integer type ("int16", "int32", etc.).
Arguments
True
, the input text will be first lowered before
tokenization.detokenize()
behavior when an invalid codepoint is encountered.
The value of 'strict'
will cause the tokenizer to produce a
InvalidArgument
error on any invalid input formatting. A value of
'replace'
will cause the tokenizer to replace any invalid
formatting in the input with the replacement_char codepoint.
A value of 'ignore'
will cause the tokenizer to skip any invalid
formatting in the input and produce no corresponding output
character.65533
. Defaults to 65533
."UTF-8"
."UTF-8"
.vocabulary_size
,
by clamping all codepoints to the range [0, vocabulary_size).
Effectively this will make the vocabulary_size - 1
id the
the OOV value.Examples
Basic Usage.
>>> inputs = "Unicode Tokenizer"
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer()
>>> outputs = tokenizer(inputs)
>>> np.array(outputs)
array([117, 110, 105, 99, 111, 100, 101, 32, 116, 111, 107, 101, 110,
105, 122, 101, 114], dtype=int32)
Ragged outputs.
>>> inputs = ["पुस्तक", "کتاب"]
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer()
>>> seq1, seq2 = tokenizer(inputs)
>>> np.array(seq1)
array([2346, 2369, 2360, 2381, 2340, 2325])
>>> np.array(seq2)
array([1705, 1578, 1575, 1576])
Dense outputs.
>>> inputs = ["पुस्तक", "کتاب"]
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=8)
>>> seq1, seq2 = tokenizer(inputs)
>>> np.array(seq1)
array([2346, 2369, 2360, 2381, 2340, 2325, 0, 0], dtype=int32)
>>> np.array(seq2)
array([1705, 1578, 1575, 1576, 0, 0, 0, 0], dtype=int32)
Tokenize, then batch for ragged outputs.
>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer()
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.map(tokenizer)
>>> ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(3))
>>> ds.take(1).get_single_element()
<tf.RaggedTensor [[98, 111, 111, 107],
[2346, 2369, 2360, 2381, 2340, 2325],
[1705, 1578, 1575, 1576]]>
Batch, then tokenize for ragged outputs.
>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer()
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.batch(3).map(tokenizer)
>>> ds.take(1).get_single_element()
<tf.RaggedTensor [[98, 111, 111, 107],
[2346, 2369, 2360, 2381, 2340, 2325],
[1705, 1578, 1575, 1576]]>
Tokenize, then batch for dense outputs (sequence_length
provided).
>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=5)
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.map(tokenizer)
>>> ds = ds.apply(tf.data.experimental.dense_to_ragged_batch(3))
>>> ds.take(1).get_single_element()
<tf.Tensor: shape=(3, 5), dtype=int32, numpy=
array([[ 98, 111, 111, 107, 0],
[2346, 2369, 2360, 2381, 2340],
[1705, 1578, 1575, 1576, 0]], dtype=int32)>
Batch, then tokenize for dense outputs (sequence_length
provided).
>>> inputs = ["Book", "पुस्तक", "کتاب"]
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=5)
>>> ds = tf.data.Dataset.from_tensor_slices(inputs)
>>> ds = ds.batch(3).map(tokenizer)
>>> ds.take(1).get_single_element()
<tf.Tensor: shape=(3, 5), dtype=int32, numpy=
array([[ 98, 111, 111, 107, 0],
[2346, 2369, 2360, 2381, 2340],
[1705, 1578, 1575, 1576, 0]], dtype=int32)>
Tokenization with truncation.
>>> inputs = ["I Like to Travel a Lot", "मैं किताबें पढ़ना पसंद करता हूं"]
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=5)
>>> outputs = tokenizer(inputs)
>>> np.array(outputs)
array([[ 105, 32, 108, 105, 107],
[2350, 2376, 2306, 32, 2325]], dtype=int32)
Tokenization with vocabulary_size.
>>> latin_ext_cutoff = 592
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer(
... vocabulary_size=latin_ext_cutoff)
>>> outputs = tokenizer("¿Cómo estás?")
>>> np.array(outputs)
array([191, 99, 243, 109, 111, 32, 101, 115, 116, 225, 115, 63],
dtype=int32)
>>> outputs = tokenizer("आप कैसे हैं")
>>> np.array(outputs)
array([591, 591, 32, 591, 591, 591, 591, 32, 591, 591, 591],
dtype=int32)
Detokenization.
>>> inputs = tf.constant([110, 105, 110, 106, 97], dtype="int32")
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer()
>>> tokenizer.detokenize(inputs)
'ninja'
Detokenization with padding.
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer(
... sequence_length=7)
>>> dataset = tf.data.Dataset.from_tensor_slices(["a b c", "b c", "a"])
>>> dataset = dataset.map(tokenizer)
>>> dataset.take(1).get_single_element()
<tf.Tensor: shape=(7,), dtype=int32,
numpy=array([97, 32, 98, 32, 99, 0, 0], dtype=int32)>
>>> detokunbatched = dataset.map(tokenizer.detokenize)
>>> detokunbatched.take(1).get_single_element()
<tf.Tensor: shape=(), dtype=string, numpy=b'a b c'>
Detokenization with invalid bytes.
>>> inputs = tf.constant([110, 105, 10000000, 110, 106, 97])
>>> tokenizer = keras_hub.tokenizers.UnicodeCodepointTokenizer(
... errors="replace", replacement_char=88)
>>> tokenizer.detokenize(inputs)
'niXnja'
tokenize
methodUnicodeCodepointTokenizer.tokenize(inputs)
Transform input tensors of strings into output tokens.
Arguments
detokenize
methodUnicodeCodepointTokenizer.detokenize(inputs)
Transform tokens back into strings.
Arguments
get_vocabulary
methodUnicodeCodepointTokenizer.get_vocabulary()
Get the tokenizer vocabulary as a list of strings terms.
vocabulary_size
methodUnicodeCodepointTokenizer.vocabulary_size()
Get the size of the tokenizer vocabulary. None implies no vocabulary size was provided
token_to_id
methodUnicodeCodepointTokenizer.token_to_id(token)
Convert a string token to an integer id.
id_to_token
methodUnicodeCodepointTokenizer.id_to_token(id)
Convert an integer id to a string token.