Keras 3 API documentation / KerasNLP / Models / GPT2 / GPT2Backbone model

GPT2Backbone model

[source]

GPT2Backbone class

keras_nlp.models.GPT2Backbone(
    vocabulary_size,
    num_layers,
    num_heads,
    hidden_dim,
    intermediate_dim,
    dropout=0.1,
    max_sequence_length=1024,
    dtype=None,
    **kwargs
)

GPT-2 core network with hyperparameters.

This network implements a Transformer-based decoder network, Generative Pretrained Transformer-2 (GPT-2), as described in "Language Models are Unsupervised Multitask Learners". It includes the embedding lookups and transformer layers.

The default constructor gives a fully customizable, randomly initialized GPT-2 model with any number of layers, heads, and embedding dimensions. To load preset architectures and weights, use the from_preset constructor.

Disclaimer: Pre-trained models are provided on an "as is" basis, without warranties or conditions of any kind. The underlying model is provided by a third party and subject to a separate license, available here.

Arguments

  • vocabulary_size: int. The size of the token vocabulary.
  • num_layers: int. The number of transformer layers.
  • num_heads: int. The number of attention heads for each transformer. The hidden size must be divisible by the number of attention heads.
  • hidden_dim: int. The size of the transformer encoding and pooler layers.
  • intermediate_dim: int. The output dimension of the first Dense layer in a two-layer feedforward network for each transformer.
  • dropout: float. Dropout probability for the Transformer encoder.
  • max_sequence_length: int. The maximum sequence length that this encoder can consume. If None, max_sequence_length uses the value from sequence length. This determines the variable shape for positional embeddings.
  • dtype: string or keras.mixed_precision.DTypePolicy. The dtype to use for the models computations and weights. Note that some computations, such as softmax and layer normalization will always be done a float32 precision regardless of dtype.

Example

input_data = {
    "token_ids": np.ones(shape=(1, 12), dtype="int32"),
    "padding_mask": np.array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]]),
}

# Pretrained GPT-2 decoder.
model = keras_nlp.models.GPT2Backbone.from_preset("gpt2_base_en")
model(input_data)

# Randomly initialized GPT-2 decoder with custom config.
model = keras_nlp.models.GPT2Backbone(
    vocabulary_size=50257,
    num_layers=12,
    num_heads=12,
    hidden_dim=768,
    intermediate_dim=3072,
    max_sequence_length=1024,
)
model(input_data)

[source]

from_preset method

GPT2Backbone.from_preset()

Instantiate GPT2Backbone model from preset architecture and weights.

Arguments

  • preset: string. Must be one of "gpt2_base_en", "gpt2_medium_en", "gpt2_large_en", "gpt2_extra_large_en", "gpt2_base_en_cnn_dailymail".
  • load_weights: Whether to load pre-trained weights into model. Defaults to True.

Examples

# Load architecture and weights from preset
model = keras_nlp.models.GPT2Backbone.from_preset(
    "gpt2_base_en"
)

# Load randomly initialized model from preset architecture
model = keras_nlp.models.GPT2Backbone.from_preset(
    "gpt2_base_en",
    load_weights=False
)
Preset name Parameters Description
gpt2_base_en 124.44M 12-layer GPT-2 model where case is maintained. Trained on WebText.
gpt2_medium_en 354.82M 24-layer GPT-2 model where case is maintained. Trained on WebText.
gpt2_large_en 774.03M 36-layer GPT-2 model where case is maintained. Trained on WebText.
gpt2_extra_large_en 1.56B 48-layer GPT-2 model where case is maintained. Trained on WebText.
gpt2_base_en_cnn_dailymail 124.44M 12-layer GPT-2 model where case is maintained. Finetuned on the CNN/DailyMail summarization dataset.

token_embedding property

keras_nlp.models.GPT2Backbone.token_embedding

A keras.layers.Embedding instance for embedding token ids.

This layer embeds integer token ids to the hidden dim of the model.