Keras 3 API documentation / KerasNLP / Pretrained Models / OPT / OPTPreprocessor layer

OPTPreprocessor layer

[source]

OPTPreprocessor class

keras_nlp.models.OPTPreprocessor(
    tokenizer, sequence_length=2048, add_start_token=True, add_end_token=True, **kwargs
)

OPT preprocessing layer which tokenizes and packs inputs.

This preprocessing layer will do 2 things:

  • Tokenize the input using the tokenizer.
  • Construct a dictionary with keys "token_ids", "padding_mask", that can be passed directly to a keras_nlp.models.OPTBackbone.

This layer can be used directly with tf.data.Dataset.map to preprocess string data in the (x, y, sample_weight) format used by keras.Model.fit.

The call method of this layer accepts three arguments, x, y, and sample_weight. x can be a python string or tensor representing a single segment, a list of python strings representing a batch of single segments, or a list of tensors representing multiple segments to be packed together. y and sample_weight are both optional, can have any format, and will be passed through unaltered.

OPTPreprocessor forces the input to have only one segment, as OPT is mainly used for generation tasks. For tasks having multi-segment inputs like "glue/mnli", please use a model designed for classification purposes such as BERT or RoBERTa.

Arguments

  • tokenizer: A keras_nlp.models.OPTTokenizer instance.
  • sequence_length: The length of the packed inputs.
  • add_start_token: If True, the preprocessor will append the tokenizer start token to each input sequence.
  • add_end_token: If True, the preprocessor will append the tokenizer end token to each input sequence.

Call arguments

  • x: A string, tf.Tensor or list of python strings.
  • y: Any label data. Will be passed through unaltered.
  • sample_weight: Any label weight data. Will be passed through unaltered.
  • sequence_length: Pass to override the configured sequence_length of the layer.

Examples

Directly calling the layer on data.

preprocessor = keras_nlp.models.OPTPreprocessor.from_preset("opt_125m_en")

# Tokenize and pack a single sentence.
preprocessor("The quick brown fox jumped.")

# Tokenize a batch of single sentences.
preprocessor(["The quick brown fox jumped.", "Call me Ishmael."])

# Custom vocabulary.
features = ["a quick fox.", "a fox quick."]
vocab = {"<|endoftext|>": 0, "a": 4, "Ġquick": 5, "Ġfox": 6}
merges = ["Ġ q", "u i", "c k", "ui ck", "Ġq uick"]
merges += ["Ġ f", "o x", "Ġf ox"]
tokenizer = keras_nlp.models.OPTTokenizer(
    vocabulary=vocab,
    merges=merges,
)
preprocessor = keras_nlp.models.OPTPreprocessor(tokenizer=tokenizer)
preprocessor("The quick brown fox jumped.")

Mapping with tf.data.Dataset.

preprocessor = keras_nlp.models.OPTPreprocessor.from_preset("opt_125m_en")

text = tf.constant(["The quick brown fox jumped.", "Call me Ishmael."])
label = tf.constant([1, 1])

# Map labeled single sentences.
ds = tf.data.Dataset.from_tensor_slices((text, label))
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)

# Map unlabeled single sentences.
ds = tf.data.Dataset.from_tensor_slices(text)
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)

[source]

from_preset method

OPTPreprocessor.from_preset(preset, **kwargs)

Instantiate a keras_nlp.models.Preprocessor from a model preset.

A preset is a directory of configs, weights and other file assets used to save and load a pre-trained model. The preset can be passed as a one of:

  1. a built in preset identifier like 'bert_base_en'
  2. a Kaggle Models handle like 'kaggle://user/bert/keras/bert_base_en'
  3. a Hugging Face handle like 'hf://user/bert_base_en'
  4. a path to a local preset directory like './bert_base_en'

For any Preprocessor subclass, you can run cls.presets.keys() to list all built-in presets available on the class.

As there are usually multiple preprocessing classes for a given model, this method should be called on a specific subclass like keras_nlp.models.BertPreprocessor.from_preset().

Arguments

  • preset: string. A built in preset identifier, a Kaggle Models handle, a Hugging Face handle, or a path to a local directory.

Examples

# Load a preprocessor for Gemma generation.
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset(
    "gemma_2b_en",
)

# Load a preprocessor for Bert classification.
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
    "bert_base_en",
)
Preset name Parameters Description
opt_125m_en 125.24M 12-layer OPT model where case in maintained. Trained on BookCorpus, CommonCrawl, Pile, and PushShift.io corpora.
opt_1.3b_en 1.32B 24-layer OPT model where case in maintained. Trained on BookCorpus, CommonCrawl, Pile, and PushShift.io corpora.
opt_2.7b_en 2.70B 32-layer OPT model where case in maintained. Trained on BookCorpus, CommonCrawl, Pile, and PushShift.io corpora.
opt_6.7b_en 6.70B 32-layer OPT model where case in maintained. Trained on BookCorpus, CommonCrawl, Pile, and PushShift.io corpora.

tokenizer property

keras_nlp.models.OPTPreprocessor.tokenizer

The tokenizer used to tokenize strings.