BytePairTokenizer classkeras_hub.tokenizers.BytePairTokenizer(
vocabulary=None,
merges=None,
sequence_length=None,
add_prefix_space=False,
unsplittable_tokens=None,
dtype="int32",
**kwargs
)
Bype-pair encoding tokenizer layer.
This BPE tokenizer provides the same functionality as the official GPT-2
tokenizer. Given the same vocabulary which maps tokens to ids, and
merges which describes BPE merge rules, it should provide the same output
as OpenAI implementation (https://github.com/openai/gpt-2/blob/master/src/encoder.py).
Different from OpenAI, this implementation is graph-compatible, so you can
use it within a tf.data pipeline.
If input is a batch of strings (rank > 0):
By default, the layer will output a tf.RaggedTensor where the last
dimension of the output is ragged. If sequence_length is set, the layer
will output a dense tf.Tensor where all inputs have been padded or
truncated to sequence_length.
If input is a scalar string (rank == 0):
By default, the layer will output a dense tf.Tensor with static shape
[None]. If sequence_length is set, the output will be
a dense tf.Tensor of shape [sequence_length].
Arguments
sequence_length. Defaults to None.False.vocabulary. Defaults to None.Examples
Tokenize
>>> vocab = {"butter": 1, "fly": 2}
>>> merge = ["b u", "t t", "e r", "bu tt", "butt er", "f l", "fl y"]
>>> tokenizer = keras_hub.tokenizers.BytePairTokenizer(vocab, merge)
>>> outputs = tokenizer("butterfly")
>>> np.array(outputs)
array([1, 2], dtype=int32)
>>> seq1, seq2 = tokenizer(["butterfly", "butter"])
>>> np.array(seq1)
array([1, 2])
>>> np.array(seq2)
array([1])
>>> tokenizer = keras_hub.tokenizers.BytePairTokenizer(
... vocab, merge, sequence_length=2)
>>> seq1, seq2 = tokenizer(["butterfly", "butter"])
>>> np.array(seq1)
array([1, 2], dtype=int32)
>>> np.array(seq2)
array([1, 0], dtype=int32)
Detokenize
>>> vocab = {"butter": 1, "fly": 2}
>>> merge = ["b u", "t t", "e r", "bu tt", "butt er", "f l", "fl y"]
>>> tokenizer = keras_hub.tokenizers.BytePairTokenizer(vocab, merge)
>>> tokenizer.detokenize([[1, 2]])
['butterfly']
tokenize methodBytePairTokenizer.tokenize(inputs)
Transform input tensors of strings into output tokens.
Arguments
detokenize methodBytePairTokenizer.detokenize(inputs)
Transform tokens back into strings.
Arguments
get_vocabulary methodBytePairTokenizer.get_vocabulary()
Get the tokenizer vocabulary as a list of strings tokens.
vocabulary_size methodBytePairTokenizer.vocabulary_size()
Get the integer size of the tokenizer vocabulary.
token_to_id methodBytePairTokenizer.token_to_id(token)
Convert a string token to an integer id.
id_to_token methodBytePairTokenizer.id_to_token(id)
Convert an integer id to a string token.