RWKV7Tokenizer

[source]

RWKVTokenizer class

keras_hub.tokenizers.RWKVTokenizer(
    vocabulary=None,
    dtype="int32",
    pad_token_id=0,
    start_token_id=None,
    end_token_id=None,
    **kwargs
)

RWKV byte-level tokenizer with longest-match trie search.

This tokenizer maps raw text to a sequence of integer token ids using a fixed vocabulary and a greedy longest-match algorithm.

Arguments

  • vocabulary: list of strings, each line formatted as " ".
  • dtype: output dtype for tensor operations. Must be integer or string type.

Examples

vocab = ["0 ' ' 1", "1 '\n' 1", "2 'the' 3", "3 'hello' 5"]
tok = RWKVTokenizer(vocabulary=vocab)
tok("hello the")

Output: [3, 0, 2]


[source]

from_preset method

RWKVTokenizer.from_preset(preset, config_file="tokenizer.json", **kwargs)

Instantiate a keras_hub.models.Tokenizer from a model preset.

A preset is a directory of configs, weights and other file assets used to save and load a pre-trained model. The preset can be passed as one of:

  1. a built-in preset identifier like 'bert_base_en'
  2. a Kaggle Models handle like 'kaggle://user/bert/keras/bert_base_en'
  3. a Hugging Face handle like 'hf://user/bert_base_en'
  4. a path to a local preset directory like './bert_base_en'

For any Tokenizer subclass, you can run cls.presets.keys() to list all built-in presets available on the class.

This constructor can be called in one of two ways. Either from the base class like keras_hub.models.Tokenizer.from_preset(), or from a model class like keras_hub.models.GemmaTokenizer.from_preset(). If calling from the base class, the subclass of the returning object will be inferred from the config in the preset directory.

Arguments

  • preset: string. A built-in preset identifier, a Kaggle Models handle, a Hugging Face handle, or a path to a local directory.
  • load_weights: bool. If True, the weights will be loaded into the model architecture. If False, the weights will be randomly initialized.

Examples

# Load a preset tokenizer.
tokenizer = keras_hub.tokenizer.Tokenizer.from_preset("bert_base_en")

# Tokenize some input.
tokenizer("The quick brown fox tripped.")

# Detokenize some input.
tokenizer.detokenize([5, 6, 7, 8, 9])
Preset Parameters Description
rwkv7_g1a_0.1b_en 150.00M 150 million parameter RWKV7 model. Optimized for edge devices and mobile deployment.
rwkv7_g1a_0.3b_en 400.00M 400 million parameter RWKV7 model. Small variant balancing speed and instruction following.