Preprocessor class


Base class for preprocessing layers.

A Preprocessor layer wraps a keras_nlp.tokenizer.Tokenizer to provide a complete preprocessing setup for a given task. For example a masked language modeling preprocessor will take in raw input strings, and output (x, y, sample_weight) tuples. Where x contains token id sequences with some

This class can be subclassed similar to any keras.layers.Layer, by defining build(), call() and get_config() methods. All subclasses should set the tokenizer property on construction.


from_preset method

Preprocessor.from_preset(preset, **kwargs)

Instantiate a keras_nlp.models.Preprocessor from a model preset.

A preset is a directory of configs, weights and other file assets used to save and load a pre-trained model. The preset can be passed as a one of:

  1. a built in preset identifier like 'bert_base_en'
  2. a Kaggle Models handle like 'kaggle://user/bert/keras/bert_base_en'
  3. a Hugging Face handle like 'hf://user/bert_base_en'
  4. a path to a local preset directory like './bert_base_en'

For any Preprocessor subclass, you can run cls.presets.keys() to list all built-in presets available on the class.

As there are usually multiple preprocessing classes for a given model, this method should be called on a specific subclass like keras_nlp.models.BertPreprocessor.from_preset().


  • preset: string. A built in preset identifier, a Kaggle Models handle, a Hugging Face handle, or a path to a local directory.


# Load a preprocessor for Gemma generation.
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset(

# Load a preprocessor for Bert classification.
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(


save_to_preset method


Save preprocessor to a preset directory.


  • preset_dir: The path to the local model preset directory.

tokenizer property


The tokenizer used to tokenize strings.