Keras 3 API documentation / KerasCV / Models / Backbones / ViTDet backbones

ViTDet backbones

[source]

ViTDetBackbone class

keras_cv.models.ViTDetBackbone(
    include_rescaling,
    input_shape=(1024, 1024, 3),
    input_tensor=None,
    patch_size=16,
    embed_dim=768,
    depth=12,
    mlp_dim=3072,
    num_heads=12,
    out_chans=256,
    use_bias=True,
    use_abs_pos=True,
    use_rel_pos=True,
    window_size=14,
    global_attention_indices=[2, 5, 8, 11],
    layer_norm_epsilon=1e-06,
    **kwargs
)

A ViT image encoder that uses a windowed transformer encoder and relative positional encodings.

Arguments

  • input_shape (tuple[int], optional): The size of the input image in (H, W, C) format. Defaults to (1024, 1024, 3).
  • input_tensor (KerasTensor, optional): Output of keras.layers.Input()) to use as image input for the model. Defaults to None.
  • include_rescaling (bool, optional): Whether to rescale the inputs. If set to True, inputs will be passed through a Rescaling(1/255.0) layer. Defaults to False.
  • patch_size (int, optional): the patch size to be supplied to the Patching layer to turn input images into a flattened sequence of patches. Defaults to 16.
  • embed_dim (int, optional): The latent dimensionality to be projected into in the output of each stacked windowed transformer encoder. Defaults to 768.
  • depth (int, optional): The number of transformer encoder layers to stack in the Vision Transformer. Defaults to 12.
  • mlp_dim (int, optional): The dimensionality of the hidden Dense layer in the transformer MLP head. Defaults to 768*4.
  • num_heads (int, optional): the number of heads to use in the MultiHeadAttentionWithRelativePE layer of each transformer encoder. Defaults to 12.
  • out_chans (int, optional): The number of channels (features) in the output (image encodings). Defaults to 256.
  • use_bias (bool, optional): Whether to use bias to project the keys, queries, and values in the attention layer. Defaults to True.
  • use_abs_pos (bool, optional): Whether to add absolute positional embeddings to the output patches. Defaults to True.
  • use_rel_pos (bool, optional): Whether to use relative positional emcodings in the attention layer. Defaults to True.
  • window_size (int, optional): The size of the window for windowed attention in the transformer encoder blocks. Defaults to 14.
  • global_attention_indices (list, optional): Indexes for blocks using global attention. Defaults to [2, 5, 8, 11].
  • layer_norm_epsilon (int, optional): The epsilon to use in the layer normalization blocks in transformer encoder. Defaults to 1e-6.

References


[source]

from_preset method

ViTDetBackbone.from_preset()

Instantiate ViTDetBackbone model from preset config and weights.

Arguments

  • preset: string. Must be one of "vitdet_base", "vitdet_large", "vitdet_huge", "vitdet_base_sa1b", "vitdet_large_sa1b", "vitdet_huge_sa1b". If looking for a preset with pretrained weights, choose one of "vitdet_base_sa1b", "vitdet_large_sa1b", "vitdet_huge_sa1b".
  • load_weights: Whether to load pre-trained weights into model. Defaults to None, which follows whether the preset has pretrained weights available.

Examples

# Load architecture and weights from preset
model = keras_cv.models.ViTDetBackbone.from_preset(
    "vitdet_base_sa1b",
)

# Load randomly initialized model from preset architecture with weights
model = keras_cv.models.ViTDetBackbone.from_preset(
    "vitdet_base_sa1b",
    load_weights=False,
Preset name Parameters Description
vitdet_base 89.67M Detectron2 ViT basebone with 12 transformer encoders with embed dim 768 and attention layers with 12 heads with global attention on encoders 2, 5, 8, and 11.
vitdet_large 308.28M Detectron2 ViT basebone with 24 transformer encoders with embed dim 1024 and attention layers with 16 heads with global attention on encoders 5, 11, 17, and 23.
vitdet_huge 637.03M Detectron2 ViT basebone model with 32 transformer encoders with embed dim 1280 and attention layers with 16 heads with global attention on encoders 7, 15, 23, and 31.
vitdet_base_sa1b 89.67M A base Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_large_sa1b 308.28M A large Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_huge_sa1b 637.03M A huge Detectron2 ViT backbone trained on the SA1B dataset.

[source]

ViTDetBBackbone class

keras_cv.models.ViTDetBBackbone(
    include_rescaling,
    input_shape=(1024, 1024, 3),
    input_tensor=None,
    patch_size=16,
    embed_dim=768,
    depth=12,
    mlp_dim=3072,
    num_heads=12,
    out_chans=256,
    use_bias=True,
    use_abs_pos=True,
    use_rel_pos=True,
    window_size=14,
    global_attention_indices=[2, 5, 8, 11],
    layer_norm_epsilon=1e-06,
    **kwargs
)

VitDetBBackbone model.

Reference

For transfer learning use cases, make sure to read the guide to transfer learning & fine-tuning.

Example

input_data = np.ones(shape=(1, 1024, 1024, 3))

# Randomly initialized backbone
model = VitDetBBackbone()
output = model(input_data)

[source]

ViTDetLBackbone class

keras_cv.models.ViTDetLBackbone(
    include_rescaling,
    input_shape=(1024, 1024, 3),
    input_tensor=None,
    patch_size=16,
    embed_dim=768,
    depth=12,
    mlp_dim=3072,
    num_heads=12,
    out_chans=256,
    use_bias=True,
    use_abs_pos=True,
    use_rel_pos=True,
    window_size=14,
    global_attention_indices=[2, 5, 8, 11],
    layer_norm_epsilon=1e-06,
    **kwargs
)

VitDetLBackbone model.

Reference

For transfer learning use cases, make sure to read the guide to transfer learning & fine-tuning.

Example

input_data = np.ones(shape=(1, 1024, 1024, 3))

# Randomly initialized backbone
model = VitDetLBackbone()
output = model(input_data)

[source]

ViTDetHBackbone class

keras_cv.models.ViTDetHBackbone(
    include_rescaling,
    input_shape=(1024, 1024, 3),
    input_tensor=None,
    patch_size=16,
    embed_dim=768,
    depth=12,
    mlp_dim=3072,
    num_heads=12,
    out_chans=256,
    use_bias=True,
    use_abs_pos=True,
    use_rel_pos=True,
    window_size=14,
    global_attention_indices=[2, 5, 8, 11],
    layer_norm_epsilon=1e-06,
    **kwargs
)

VitDetHBackbone model.

Reference

For transfer learning use cases, make sure to read the guide to transfer learning & fine-tuning.

Example

input_data = np.ones(shape=(1, 1024, 1024, 3))

# Randomly initialized backbone
model = VitDetHBackbone()
output = model(input_data)