ยป Keras API reference / KerasNLP / Tokenizers / WordPieceTokenizer

WordPieceTokenizer

[source]

WordPieceTokenizer class

keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=None,
    sequence_length: int = None,
    lowercase: bool = True,
    strip_accents: bool = True,
    split: bool = True,
    split_pattern: str = None,
    keep_pattern: str = None,
    suffix_indicator: str = "##",
    oov_token: str = "[UNK]",
    **kwargs
)

A word piece tokenizer layer.

This layer provides an efficient, in graph, implementation of the WordPiece algorithm used by BERT and other models.

To make this layer more useful out of the box, the layer will pre-tokenize the input, which will optionally lower-case, strip accents, and split the input on whitespace and punctuation. Each of these pre-tokenization steps is not reversible. The detokenize method will join words with a space, and will not invert tokenize exactly.

If a more custom pre-tokenization step is desired, the layer can be configured to apply only the strict WordPiece algorithm by passing lowercase=False, strip_accents=False and split=False. In this case, inputs should be pre-split string tensors or ragged tensors.

Tokenizer outputs can either be padded and truncated with a sequence_length argument, or left un-truncated. The exact output will depend on the rank of the input tensors.

If input is a batch of strings (rank > 0): By default, the layer will output a tf.RaggedTensor where the last dimension of the output is ragged. If sequence_length is set, the layer will output a dense tf.Tensor where all inputs have been padded or truncated to sequence_length.

If input is a scalar string (rank == 0): By default, the layer will output a dense tf.Tensor with static shape [None]. If sequence_length is set, the output will be a dense tf.Tensor of shape [sequence_length].

The output dtype can be controlled via the dtype argument, which should be either an integer or string type.

Arguments

  • vocabulary: A list of strings or a string filename path. If passing a list, each element of the list should be a single word piece token string. If passing a filename, the file should be a plain text file containing a single word piece token per line.
  • sequence_length: If set, the output will be converted to a dense tensor and padded/trimmed so all outputs are of sequence_length.
  • lowercase: If true, the input text will be first lowered before tokenization.
  • strip_accents: If true, all accent marks will be removed from text before tokenization.
  • split: If true, input will be split according to split_pattern and keep_pattern. If false, input should be split before calling the layer.
  • split_pattern: A regex pattern to match delimiters to split. By default, all whitespace and punctuation marks will be split on.
  • keep_pattern: A regex pattern of delimiters contained in the split_pattern of delimeters that should be kept as independent tokens. By default, all punctuation marks will be kept as tokens.
  • suffix_indicator: The characters prepended to a wordpiece to indicate that it is a suffix to another subword.
  • oov_token: The string value to substitute for an unknown token. It must be included in the vocab.

References

Examples

Ragged outputs.

>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = ["The quick brown fox."]
>>> tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(vocabulary=vocab)
>>> tokenizer(inputs)
<tf.RaggedTensor [[1, 2, 3, 4, 5, 6, 7]]>

Dense outputs.

>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = ["The quick brown fox."]
>>> tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
...     vocabulary=vocab, sequence_length=10)
>>> tokenizer(inputs)
<tf.Tensor: shape=(1, 10), dtype=int32, numpy=
array([[1, 2, 3, 4, 5, 6, 7, 0, 0, 0]], dtype=int32)>

String output.

>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = ["The quick brown fox."]
>>> tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
...     vocabulary=vocab, dtype="string")
>>> tokenizer(inputs)
<tf.RaggedTensor [[b'the', b'qu', b'##ick', b'br', b'##own', b'fox', b'.']]>

Detokenization.

>>> vocab = ["[UNK]", "the", "qu", "##ick", "br", "##own", "fox", "."]
>>> inputs = "The quick brown fox."
>>> tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(vocabulary=vocab)
>>> tokenizer.detokenize(tokenizer.tokenize(inputs)).numpy().decode('utf-8')
'the quick brown fox .'

Custom splitting.

>>> vocab = ["[UNK]", "fox", ","]
>>> inputs = ["fox,,fox,fox"]
>>> keras_nlp.tokenizers.WordPieceTokenizer(vocabulary=vocab,
...     split_pattern=",", keep_pattern=",", dtype='string')(inputs)
<tf.RaggedTensor [[b'fox', b',', b',', b'fox', b',', b'fox']]>
>>> keras_nlp.tokenizers.WordPieceTokenizer(vocabulary=vocab,
...     split_pattern=",", keep_pattern="", dtype='string')(inputs)
<tf.RaggedTensor [[b'fox', b'fox', b'fox']]>

[source]

tokenize method

WordPieceTokenizer.tokenize(inputs)

Transform input tensors of strings into output tokens.

Arguments

  • inputs: Input tensor, or dict/list/tuple of input tensors.
  • *args: Additional positional arguments.
  • **kwargs: Additional keyword arguments.

[source]

detokenize method

WordPieceTokenizer.detokenize(inputs)

Transform tokens back into strings.

Arguments

  • inputs: Input tensor, or dict/list/tuple of input tensors.
  • *args: Additional positional arguments.
  • **kwargs: Additional keyword arguments.

[source]

get_vocabulary method

WordPieceTokenizer.get_vocabulary()

Get the tokenizer vocabulary as a list of strings tokens.


[source]

vocabulary_size method

WordPieceTokenizer.vocabulary_size()

Get the size of the tokenizer vocabulary.


[source]

token_to_id method

WordPieceTokenizer.token_to_id(token: str)

Convert a string token to an integer id.


[source]

id_to_token method

WordPieceTokenizer.id_to_token(id: int)

Convert an integer id to a string token.