Keras 3 API documentation / KerasNLP / Pretrained Models / Electra / ElectraPreprocessor layer

ElectraPreprocessor layer

[source]

ElectraPreprocessor class

keras_nlp.models.ElectraPreprocessor(
    tokenizer, sequence_length=512, truncate="round_robin", **kwargs
)

A ELECTRA preprocessing layer which tokenizes and packs inputs.

This preprocessing layer will do three things:

  1. Tokenize any number of input segments using the tokenizer.
  2. Pack the inputs together using a keras_nlp.layers.MultiSegmentPacker. with the appropriate "[CLS]", "[SEP]" and "[PAD]" tokens.
  3. Construct a dictionary of with keys "token_ids" and "padding_mask", that can be passed directly to a ELECTRA model.

This layer can be used directly with tf.data.Dataset.map to preprocess string data in the (x, y, sample_weight) format used by keras.Model.fit.

Arguments

  • tokenizer: A keras_nlp.models.ElectraTokenizer instance.
  • sequence_length: The length of the packed inputs.
  • truncate: string. The algorithm to truncate a list of batched segments to fit within sequence_length. The value can be either round_robin or waterfall: - "round_robin": Available space is assigned one token at a time in a round-robin fashion to the inputs that still need some, until the limit is reached. - "waterfall": The allocation of the budget is done using a "waterfall" algorithm that allocates quota in a left-to-right manner and fills up the buckets until we run out of budget. It supports an arbitrary number of segments.

Call arguments

  • x: A tensor of single string sequences, or a tuple of multiple tensor sequences to be packed together. Inputs may be batched or unbatched. For single sequences, raw python inputs will be converted to tensors. For multiple sequences, pass tensors directly.
  • y: Any label data. Will be passed through unaltered.
  • sample_weight: Any label weight data. Will be passed through unaltered.

Examples

Directly calling the layer on data.

preprocessor = keras_nlp.models.ElectraPreprocessor.from_preset(
    "electra_base_discriminator_en"
)
preprocessor(["The quick brown fox jumped.", "Call me Ishmael."])

# Custom vocabulary.
vocab = ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]
vocab += ["The", "quick", "brown", "fox", "jumped", "."]
tokenizer = keras_nlp.models.ElectraTokenizer(vocabulary=vocab)
preprocessor = keras_nlp.models.ElectraPreprocessor(tokenizer)
preprocessor("The quick brown fox jumped.")

Mapping with tf.data.Dataset.

preprocessor = keras_nlp.models.ElectraPreprocessor.from_preset(
    "electra_base_discriminator_en"
)

first = tf.constant(["The quick brown fox jumped.", "Call me Ishmael."])
second = tf.constant(["The fox tripped.", "Oh look, a whale."])
label = tf.constant([1, 1])
# Map labeled single sentences.
ds = tf.data.Dataset.from_tensor_slices((first, label))
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)


# Map unlabeled single sentences.
ds = tf.data.Dataset.from_tensor_slices(first)
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)

# Map labeled sentence pairs.
ds = tf.data.Dataset.from_tensor_slices(((first, second), label))
ds = ds.map(preprocessor, num_parallel_calls=tf.data.AUTOTUNE)
# Map unlabeled sentence pairs.
ds = tf.data.Dataset.from_tensor_slices((first, second))

# Watch out for tf.data's default unpacking of tuples here!
# Best to invoke the `preprocessor` directly in this case.
ds = ds.map(
    lambda first, second: preprocessor(x=(first, second)),
    num_parallel_calls=tf.data.AUTOTUNE,
)

[source]

from_preset method

ElectraPreprocessor.from_preset(preset, **kwargs)

Instantiate a keras_nlp.models.Preprocessor from a model preset.

A preset is a directory of configs, weights and other file assets used to save and load a pre-trained model. The preset can be passed as a one of:

  1. a built in preset identifier like 'bert_base_en'
  2. a Kaggle Models handle like 'kaggle://user/bert/keras/bert_base_en'
  3. a Hugging Face handle like 'hf://user/bert_base_en'
  4. a path to a local preset directory like './bert_base_en'

For any Preprocessor subclass, you can run cls.presets.keys() to list all built-in presets available on the class.

As there are usually multiple preprocessing classes for a given model, this method should be called on a specific subclass like keras_nlp.models.BertPreprocessor.from_preset().

Arguments

  • preset: string. A built in preset identifier, a Kaggle Models handle, a Hugging Face handle, or a path to a local directory.

Examples

# Load a preprocessor for Gemma generation.
preprocessor = keras_nlp.models.GemmaCausalLMPreprocessor.from_preset(
    "gemma_2b_en",
)

# Load a preprocessor for Bert classification.
preprocessor = keras_nlp.models.BertPreprocessor.from_preset(
    "bert_base_en",
)
Preset name Parameters Description
electra_small_discriminator_uncased_en 13.55M 12-layer small ELECTRA discriminator model. All inputs are lowercased. Trained on English Wikipedia + BooksCorpus.
electra_small_generator_uncased_en 13.55M 12-layer small ELECTRA generator model. All inputs are lowercased. Trained on English Wikipedia + BooksCorpus.
electra_base_discriminator_uncased_en 109.48M 12-layer base ELECTRA discriminator model. All inputs are lowercased. Trained on English Wikipedia + BooksCorpus.
electra_base_generator_uncased_en 33.58M 12-layer base ELECTRA generator model. All inputs are lowercased. Trained on English Wikipedia + BooksCorpus.
electra_large_discriminator_uncased_en 335.14M 24-layer large ELECTRA discriminator model. All inputs are lowercased. Trained on English Wikipedia + BooksCorpus.
electra_large_generator_uncased_en 51.07M 24-layer large ELECTRA generator model. All inputs are lowercased. Trained on English Wikipedia + BooksCorpus.

tokenizer property

keras_nlp.models.ElectraPreprocessor.tokenizer

The tokenizer used to tokenize strings.