ViTBackbone model

[source]

ViTBackbone class

keras_hub.models.ViTBackbone(
    image_shape,
    patch_size,
    num_layers,
    num_heads,
    hidden_dim,
    mlp_dim,
    dropout_rate=0.0,
    attention_dropout=0.0,
    layer_norm_epsilon=1e-06,
    use_mha_bias=True,
    use_mlp_bias=True,
    data_format=None,
    dtype=None,
    **kwargs
)

Vision Transformer (ViT) backbone.

This backbone implements the Vision Transformer architecture as described in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. It transforms the input image into a sequence of patches, embeds them, and then processes them through a series of Transformer encoder layers.

Arguments

  • image_shape: A tuple or list of 3 integers representing the shape of the input image (height, width, channels), height and width must be equal.
  • patch_size: int. The size of each image patch, the input image will be divided into patches of shape (patch_size, patch_size).
  • num_layers: int. The number of transformer encoder layers.
  • num_heads: int. specifying the number of attention heads in each Transformer encoder layer.
  • hidden_dim: int. The dimensionality of the hidden representations.
  • mlp_dim: int. The dimensionality of the intermediate MLP layer in each Transformer encoder layer.
  • dropout_rate: float. The dropout rate for the Transformer encoder layers.
  • attention_dropout: float. The dropout rate for the attention mechanism in each Transformer encoder layer.
  • layer_norm_epsilon: float. Value used for numerical stability in layer normalization.
  • use_mha_bias: bool. Whether to use bias in the multi-head attention layers.
  • use_mlp_bias: bool. Whether to use bias in the MLP layers.
  • data_format: str. "channels_last" or "channels_first", specifying the data format for the input image. If None, defaults to "channels_last".
  • dtype: The dtype of the layer weights. Defaults to None.
  • **kwargs: Additional keyword arguments to be passed to the parent Backbone class.

[source]

from_preset method

ViTBackbone.from_preset(preset, load_weights=True, **kwargs)

Instantiate a keras_hub.models.Backbone from a model preset.

A preset is a directory of configs, weights and other file assets used to save and load a pre-trained model. The preset can be passed as a one of:

  1. a built-in preset identifier like 'bert_base_en'
  2. a Kaggle Models handle like 'kaggle://user/bert/keras/bert_base_en'
  3. a Hugging Face handle like 'hf://user/bert_base_en'
  4. a path to a local preset directory like './bert_base_en'

This constructor can be called in one of two ways. Either from the base class like keras_hub.models.Backbone.from_preset(), or from a model class like keras_hub.models.GemmaBackbone.from_preset(). If calling from the base class, the subclass of the returning object will be inferred from the config in the preset directory.

For any Backbone subclass, you can run cls.presets.keys() to list all built-in presets available on the class.

Arguments

  • preset: string. A built-in preset identifier, a Kaggle Models handle, a Hugging Face handle, or a path to a local directory.
  • load_weights: bool. If True, the weights will be loaded into the model architecture. If False, the weights will be randomly initialized.

Examples

# Load a Gemma backbone with pre-trained weights.
model = keras_hub.models.Backbone.from_preset(
    "gemma_2b_en",
)

# Load a Bert backbone with a pre-trained config and random weights.
model = keras_hub.models.Backbone.from_preset(
    "bert_base_en",
    load_weights=False,
)
Preset Parameters Description
vit_base_patch16_224_imagenet 85.80M ViT-B16 model pre-trained on the ImageNet 1k dataset with image resolution of 224x224
vit_base_patch16_224_imagenet21k 85.80M ViT-B16 backbone pre-trained on the ImageNet 21k dataset with image resolution of 224x224
vit_base_patch16_384_imagenet 86.09M ViT-B16 model pre-trained on the ImageNet 1k dataset with image resolution of 384x384
vit_base_patch32_224_imagenet21k 87.46M ViT-B32 backbone pre-trained on the ImageNet 21k dataset with image resolution of 224x224
vit_base_patch32_384_imagenet 87.53M ViT-B32 model pre-trained on the ImageNet 1k dataset with image resolution of 384x384
vit_large_patch16_224_imagenet 303.30M ViT-L16 model pre-trained on the ImageNet 1k dataset with image resolution of 224x224
vit_large_patch16_224_imagenet21k 303.30M ViT-L16 backbone pre-trained on the ImageNet 21k dataset with image resolution of 224x224
vit_large_patch16_384_imagenet 303.69M ViT-L16 model pre-trained on the ImageNet 1k dataset with image resolution of 384x384
vit_large_patch32_224_imagenet21k 305.51M ViT-L32 backbone pre-trained on the ImageNet 21k dataset with image resolution of 224x224
vit_large_patch32_384_imagenet 305.61M ViT-L32 model pre-trained on the ImageNet 1k dataset with image resolution of 384x384
vit_huge_patch14_224_imagenet21k 630.76M ViT-H14 backbone pre-trained on the ImageNet 21k dataset with image resolution of 224x224