► KerasHub: Pretrained Models / API documentation / Model Architectures / ViT / ViTBackbone model

ViTBackbone model

`ViTBackbone` class

keras_hub.models.ViTBackbone(
    image_shape,
    patch_size,
    num_layers,
    num_heads,
    hidden_dim,
    mlp_dim,
    dropout_rate=0.0,
    attention_dropout=0.0,
    layer_norm_epsilon=1e-06,
    use_mha_bias=True,
    use_mlp_bias=True,
    data_format=None,
    dtype=None,
    **kwargs
)

Vision Transformer (ViT) backbone.

This backbone implements the Vision Transformer architecture as described in An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. It transforms the input image into a sequence of patches, embeds them, and then processes them through a series of Transformer encoder layers.

Arguments

image_shape: A tuple or list of 3 integers representing the shape of the input image (height, width, channels), height and width must be equal.
patch_size: int. The size of each image patch, the input image will be divided into patches of shape (patch_size, patch_size).
num_layers: int. The number of transformer encoder layers.
num_heads: int. specifying the number of attention heads in each Transformer encoder layer.
hidden_dim: int. The dimensionality of the hidden representations.
mlp_dim: int. The dimensionality of the intermediate MLP layer in each Transformer encoder layer.
dropout_rate: float. The dropout rate for the Transformer encoder layers.
attention_dropout: float. The dropout rate for the attention mechanism in each Transformer encoder layer.
layer_norm_epsilon: float. Value used for numerical stability in layer normalization.
use_mha_bias: bool. Whether to use bias in the multi-head attention layers.
use_mlp_bias: bool. Whether to use bias in the MLP layers.
data_format: str. "channels_last" or "channels_first", specifying the data format for the input image. If None, defaults to "channels_last".
dtype: The dtype of the layer weights. Defaults to None.
**kwargs: Additional keyword arguments to be passed to the parent Backbone class.

[source]

`from_preset` method

ViTBackbone.from_preset(preset, load_weights=True, **kwargs)

Instantiate a keras_hub.models.Backbone from a model preset.

A preset is a directory of configs, weights and other file assets used to save and load a pre-trained model. The preset can be passed as a one of:

a built-in preset identifier like 'bert_base_en'
a Kaggle Models handle like 'kaggle://user/bert/keras/bert_base_en'
a Hugging Face handle like 'hf://user/bert_base_en'
a path to a local preset directory like './bert_base_en'

This constructor can be called in one of two ways. Either from the base class like keras_hub.models.Backbone.from_preset(), or from a model class like keras_hub.models.GemmaBackbone.from_preset(). If calling from the base class, the subclass of the returning object will be inferred from the config in the preset directory.

For any Backbone subclass, you can run cls.presets.keys() to list all built-in presets available on the class.

Arguments

preset: string. A built-in preset identifier, a Kaggle Models handle, a Hugging Face handle, or a path to a local directory.
load_weights: bool. If True, the weights will be loaded into the model architecture. If False, the weights will be randomly initialized.

Examples

# Load a Gemma backbone with pre-trained weights.
model = keras_hub.models.Backbone.from_preset(
    "gemma_2b_en",
)

# Load a Bert backbone with a pre-trained config and random weights.
model = keras_hub.models.Backbone.from_preset(
    "bert_base_en",
    load_weights=False,
)

Preset	Parameters	Description
vit_base_patch16_224_imagenet	85.80M	ViT-B16 model pre-trained on the ImageNet 1k dataset with image resolution of 224x224
vit_base_patch16_224_imagenet21k	85.80M	ViT-B16 backbone pre-trained on the ImageNet 21k dataset with image resolution of 224x224
vit_base_patch16_384_imagenet	86.09M	ViT-B16 model pre-trained on the ImageNet 1k dataset with image resolution of 384x384
vit_base_patch32_224_imagenet21k	87.46M	ViT-B32 backbone pre-trained on the ImageNet 21k dataset with image resolution of 224x224
vit_base_patch32_384_imagenet	87.53M	ViT-B32 model pre-trained on the ImageNet 1k dataset with image resolution of 384x384
vit_large_patch16_224_imagenet	303.30M	ViT-L16 model pre-trained on the ImageNet 1k dataset with image resolution of 224x224
vit_large_patch16_224_imagenet21k	303.30M	ViT-L16 backbone pre-trained on the ImageNet 21k dataset with image resolution of 224x224
vit_large_patch16_384_imagenet	303.69M	ViT-L16 model pre-trained on the ImageNet 1k dataset with image resolution of 384x384
vit_large_patch32_224_imagenet21k	305.51M	ViT-L32 backbone pre-trained on the ImageNet 21k dataset with image resolution of 224x224
vit_large_patch32_384_imagenet	305.61M	ViT-L32 model pre-trained on the ImageNet 1k dataset with image resolution of 384x384
vit_huge_patch14_224_imagenet21k	630.76M	ViT-H14 backbone pre-trained on the ImageNet 21k dataset with image resolution of 224x224

ViTBackbone model

ViTBackbone class

from_preset method

ViTBackbone model

ViTBackbone class

from_preset method

`ViTBackbone` class

`from_preset` method