DINOV2Backbone model

[source]

DINOV2Backbone class

keras_hub.models.DINOV2Backbone(
    patch_size,
    num_layers,
    hidden_dim,
    num_heads,
    intermediate_dim,
    layer_scale_init_value=1.0,
    num_register_tokens=0,
    use_mask_token=True,
    use_swiglu_ffn=False,
    dropout_rate=0.0,
    drop_path_rate=0.0,
    image_shape=(224, 224, 3),
    position_embedding_shape=(518, 518, 3),
    antialias_in_interpolation=False,
    data_format=None,
    dtype=None,
    name=None,
    **kwargs
)

DINOV2 core network with hyperparameters.

DINOV2 offers a powerful, generalist visual backbone learned entirely from unlabeled images as described in DINOv2: Learning Robust Visual Features without Supervision

The default constructor gives a fully customizable, randomly initialized DINOV2 model with any number of layers, heads, and embedding dimensions. To load preset architectures and weights, use the from_preset constructor.

Note that this backbone supports interpolation of the position embeddings to the input image shape. This is useful when the input image shape is different from the shape used to train the position embeddings. The position_embedding_shape argument is used to specify the original shape used to train the position embeddings.

Arguments

  • patch_size: int. The size of each square patch in the input image.
  • num_layers: int. The number of transformer layers.
  • hidden_dim: int. The size of the transformer hidden state at the end of each transformer layer.
  • num_heads: int. The number of attention heads for each transformer.
  • intermediate_dim: int. The output dimension of the first Dense layer in a two-layer feedforward network for each transformer.
  • layer_scale_init_value: float. The initial value for the layer scale in the transformer layers. Defaults to 1.0.
  • num_register_tokens: int. The number of register tokens to use in the embedding layer. Defaults to 0.
  • use_mask_token: bool. Whether to use a mask token in the embedding layer. Defaults to True.
  • use_swiglu_ffn: bool. Whether to use SwigLU activation in the MLP layers. Defaults to False.
  • dropout_rate: float. The dropout rate to use. Defaults to 0.0.
  • drop_path_rate: float. The drop path rate to use. Defaults to 0.0.
  • image_shape: tuple. The input shape without the batch size. Defaults to (224, 224, 3).
  • position_embedding_shape: tuple. The original shape used to train the position embeddings. This is used to interpolate the position embeddings to the actual input shape. Defaults to (518, 518).
  • antialias_in_interpolation: bool. Whether to use antialiasing in the interpolation of the position embeddings. Defaults to False.
  • data_format: None or str. If specified, either "channels_last" or "channels_first". The ordering of the dimensions in the inputs. "channels_last" corresponds to inputs with shape (batch_size, height, width, channels) while "channels_first" corresponds to inputs with shape (batch_size, channels, height, width). It defaults to the image_data_format value found in your Keras config file at ~/.keras/keras.json. If you never set it, then it will be "channels_last".
  • dtype: string or keras.mixed_precision.DTypePolicy. The dtype to use for the models computations and weights. Note that some computations, such as softmax and layer normalization will always be done a float32 precision regardless of dtype.

Example

# Pretrained DINOV2 model.
input_data = {
    "images": np.ones(shape=(1, 518, 518, 3), dtype="float32"),
}
model = keras_hub.models.DINOV2Backbone.from_preset(
    "dinov2_base"
)
model(input_data)

# Pretrained DINOV2 model with custom image shape.
input_data = {
    "images": np.ones(shape=(1, 224, 224, 3), dtype="float32"),
}
model = keras_hub.models.DINOV2Backbone.from_preset(
    "dinov2_base", image_shape=(224, 224, 3)
)
model(input_data)

# Randomly initialized DINOV2 model with custom config.
model = keras_hub.models.DINOV2Backbone(
    patch_size=14,
    num_layers=2,
    hidden_dim=32,
    num_heads=2,
    intermediate_dim=128,
    image_shape=(224, 224, 3),
    position_embedding_shape=(518, 518),
)
model(input_data)

[source]

from_preset method

DINOV2Backbone.from_preset(preset, load_weights=True, **kwargs)

Instantiate a keras_hub.models.Backbone from a model preset.

A preset is a directory of configs, weights and other file assets used to save and load a pre-trained model. The preset can be passed as a one of:

  1. a built-in preset identifier like 'bert_base_en'
  2. a Kaggle Models handle like 'kaggle://user/bert/keras/bert_base_en'
  3. a Hugging Face handle like 'hf://user/bert_base_en'
  4. a path to a local preset directory like './bert_base_en'

This constructor can be called in one of two ways. Either from the base class like keras_hub.models.Backbone.from_preset(), or from a model class like keras_hub.models.GemmaBackbone.from_preset(). If calling from the base class, the subclass of the returning object will be inferred from the config in the preset directory.

For any Backbone subclass, you can run cls.presets.keys() to list all built-in presets available on the class.

Arguments

  • preset: string. A built-in preset identifier, a Kaggle Models handle, a Hugging Face handle, or a path to a local directory.
  • load_weights: bool. If True, the weights will be loaded into the model architecture. If False, the weights will be randomly initialized.

Examples

# Load a Gemma backbone with pre-trained weights.
model = keras_hub.models.Backbone.from_preset(
    "gemma_2b_en",
)

# Load a Bert backbone with a pre-trained config and random weights.
model = keras_hub.models.Backbone.from_preset(
    "bert_base_en",
    load_weights=False,
)
Preset Parameters Description
dinov2_small 22.58M Vision Transformer (small-sized model) trained using DINOv2.
dinov2_with_registers_small 22.58M Vision Transformer (small-sized model) trained using DINOv2, with registers.
dinov2_base 87.63M Vision Transformer (base-sized model) trained using DINOv2.
dinov2_with_registers_base 87.64M Vision Transformer (base-sized model) trained using DINOv2, with registers.
dinov2_large 305.77M Vision Transformer (large-sized model) trained using DINOv2.
dinov2_with_registers_large 305.78M Vision Transformer (large-sized model) trained using DINOv2, with registers.
dinov2_giant 1.14B Vision Transformer (giant-sized model) trained using DINOv2.
dinov2_with_registers_giant 1.14B Vision Transformer (giant-sized model) trained using DINOv2, with registers.