Keras 3 API documentation / KerasCV / KerasCV Models

KerasCV Models

KerasCV contains end-to-end implementations of popular model architectures. These models can be created in two ways:

  • Through the from_preset() constructor, which instantiates an object with a pre-trained configuration, and (optionally) weights. Available preset names are listed on this page.
model = keras_cv.models.RetinaNet.from_preset(
    "resnet50_v2_imagenet",
    num_classes=20,
    bounding_box_format="xywh",
)
  • Through custom configuration controlled by the user. To do this, simply pass the desired configuration parameters to the default constructors of the symbols documented below.
backbone = keras_cv.models.ResNetBackbone(
    stackwise_filters=[64, 128, 256, 512],
    stackwise_blocks=[2, 2, 2, 2],
    stackwise_strides=[1, 2, 2, 2],
    include_rescaling=False,
)
model = keras_cv.models.RetinaNet(
    backbone=backbone,
    num_classes=20,
    bounding_box_format="xywh",
)

Backbone presets

Each of the following preset name corresponds to a configuration and weights for a backbone model.

The names below can be used with the from_preset() constructor for the corresponding backbone model.

backbone = keras_cv.models.ResNetBackbone.from_preset("resnet50_imagenet")

For brevity, we do not include the presets without pretrained weights in the following table.

Note: All pretrained weights should be used with unnormalized pixel intensities in the range [0, 255] if include_rescaling=True or in the range [0, 1] if including_rescaling=False.

Preset name Model Parameters Description
csp_darknet_l_imagenet CSPDarkNet 27.11M CSPDarkNet model with [128, 256, 512, 1024] channels and [3, 9, 9, 3] depths where the batch normalization and SiLU activation are applied after the convolution layers. Trained on Imagenet 2012 classification task.
csp_darknet_tiny_imagenet CSPDarkNet 2.38M CSPDarkNet model with [48, 96, 192, 384] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers. Trained on Imagenet 2012 classification task.
csp_darknet_tiny CSPDarkNet 2.38M CSPDarkNet model with [48, 96, 192, 384] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers.
csp_darknet_s CSPDarkNet 4.22M CSPDarkNet model with [64, 128, 256, 512] channels and [1, 3, 3, 1] depths where the batch normalization and SiLU activation are applied after the convolution layers.
csp_darknet_m CSPDarkNet 12.37M CSPDarkNet model with [96, 192, 384, 768] channels and [2, 6, 6, 2] depths where the batch normalization and SiLU activation are applied after the convolution layers.
csp_darknet_l CSPDarkNet 27.11M CSPDarkNet model with [128, 256, 512, 1024] channels and [3, 9, 9, 3] depths where the batch normalization and SiLU activation are applied after the convolution layers.
csp_darknet_xl CSPDarkNet 56.84M CSPDarkNet model with [170, 340, 680, 1360] channels and [4, 12, 12, 4] depths where the batch normalization and SiLU activation are applied after the convolution layers.
densenet121_imagenet Unknown Unknown DenseNet model with 121 layers. Trained on Imagenet 2012 classification task.
densenet169_imagenet Unknown Unknown DenseNet model with 169 layers. Trained on Imagenet 2012 classification task.
densenet201_imagenet Unknown Unknown DenseNet model with 201 layers. Trained on Imagenet 2012 classification task.
densenet121 Unknown Unknown DenseNet model with 121 layers.
densenet169 Unknown Unknown DenseNet model with 169 layers.
densenet201 Unknown Unknown DenseNet model with 201 layers.
efficientnetlite_b0 EfficientNetLite 3.41M EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.0.
efficientnetlite_b1 EfficientNetLite 4.19M EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.1.
efficientnetlite_b2 EfficientNetLite 4.87M EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.1 and depth_coefficient=1.2.
efficientnetlite_b3 EfficientNetLite 6.99M EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.2 and depth_coefficient=1.4.
efficientnetlite_b4 EfficientNetLite 11.84M EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.4 and depth_coefficient=1.8.
efficientnetv1_b0 EfficientNetV1 4.05M EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.0.
efficientnetv1_b1 EfficientNetV1 6.58M EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.1.
efficientnetv1_b2 EfficientNetV1 7.77M EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.1 and depth_coefficient=1.2.
efficientnetv1_b3 EfficientNetV1 10.79M EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.2 and depth_coefficient=1.4.
efficientnetv1_b4 EfficientNetV1 17.68M EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.4 and depth_coefficient=1.8.
efficientnetv1_b5 EfficientNetV1 28.52M EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.6 and depth_coefficient=2.2.
efficientnetv1_b6 EfficientNetV1 40.97M EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.8 and depth_coefficient=2.6.
efficientnetv1_b7 EfficientNetV1 64.11M EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=2.0 and depth_coefficient=3.1.
efficientnetv2_b0_imagenet EfficientNetV2 5.92M EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.0. Weights are initialized to pretrained imagenet classification weights. Published weights are capable of scoring 77.1% top 1 accuracy and 93.3% top 5 accuracy on imagenet.
efficientnetv2_b1_imagenet EfficientNetV2 6.93M EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.1. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 79.1% top 1 accuracy and 94.4% top 5 accuracy on imagenet.
efficientnetv2_b2_imagenet EfficientNetV2 8.77M EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.1 and depth_coefficient=1.2. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 80.1% top 1 accuracy and 94.9% top 5 accuracy on imagenet.
efficientnetv2_s_imagenet EfficientNetV2 20.33M EfficientNet architecture with 6 convolutional blocks. Weights are initialized to pretrained imagenet classification weights.Published weights are capable of scoring 83.9%top 1 accuracy and 96.7% top 5 accuracy on imagenet.
efficientnetv2_s EfficientNetV2 20.33M EfficientNet architecture with 6 convolutional blocks.
efficientnetv2_m EfficientNetV2 53.15M EfficientNet architecture with 7 convolutional blocks.
efficientnetv2_l EfficientNetV2 117.75M EfficientNet architecture with 7 convolutional blocks, but more filters the in efficientnetv2_m.
efficientnetv2_b0 EfficientNetV2 5.92M EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.0.
efficientnetv2_b1 EfficientNetV2 6.93M EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.0 and depth_coefficient=1.1.
efficientnetv2_b2 EfficientNetV2 8.77M EfficientNet B-style architecture with 6 convolutional blocks. This B-style model has width_coefficient=1.1 and depth_coefficient=1.2.
efficientnetv2_b3 EfficientNetV2 12.93M EfficientNet B-style architecture with 7 convolutional blocks. This B-style model has width_coefficient=1.2 and depth_coefficient=1.4.
mit_b0_imagenet MiT 3.32M MiT (MixTransformer) model with 8 transformer blocks. Pre-trained on ImageNet-1K and scores 69% top-1 accuracy on the validation set.
mit_b0 MiT 3.32M MiT (MixTransformer) model with 8 transformer blocks.
mit_b1 MiT 13.16M MiT (MixTransformer) model with 8 transformer blocks.
mit_b2 MiT 24.20M MiT (MixTransformer) model with 16 transformer blocks.
mit_b3 MiT 44.08M MiT (MixTransformer) model with 28 transformer blocks.
mit_b4 MiT 60.85M MiT (MixTransformer) model with 41 transformer blocks.
mit_b5 MiT 81.45M MiT (MixTransformer) model with 52 transformer blocks.
mobilenet_v3_large_imagenet MobileNetV3 2.99M MobileNetV3 model with 28 layers where the batch normalization and hard-swish activation are applied after the convolution layers. Pre-trained on the ImageNet 2012 classification task.
mobilenet_v3_small_imagenet MobileNetV3 933.50K MobileNetV3 model with 14 layers where the batch normalization and hard-swish activation are applied after the convolution layers. Pre-trained on the ImageNet 2012 classification task.
mobilenet_v3_small MobileNetV3 933.50K MobileNetV3 model with 14 layers where the batch normalization and hard-swish activation are applied after the convolution layers.
mobilenet_v3_large MobileNetV3 2.99M MobileNetV3 model with 28 layers where the batch normalization and hard-swish activation are applied after the convolution layers.
resnet50_imagenet ResNetV1 23.56M ResNet model with 50 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style). Trained on Imagenet 2012 classification task.
resnet18 ResNetV1 11.19M ResNet model with 18 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet34 ResNetV1 21.30M ResNet model with 34 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet50 ResNetV1 23.56M ResNet model with 50 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet101 ResNetV1 42.61M ResNet model with 101 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet152 ResNetV1 58.30M ResNet model with 152 layers where the batch normalization and ReLU activation are applied after the convolution layers (v1 style).
resnet50_v2_imagenet ResNetV2 23.56M ResNet model with 50 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style). Trained on Imagenet 2012 classification task.
resnet18_v2 ResNetV2 11.18M ResNet model with 18 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
resnet34_v2 ResNetV2 21.30M ResNet model with 34 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
resnet50_v2 ResNetV2 23.56M ResNet model with 50 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
resnet101_v2 ResNetV2 42.63M ResNet model with 101 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
resnet152_v2 ResNetV2 58.33M ResNet model with 152 layers where the batch normalization and ReLU activation precede the convolution layers (v2 style).
videoswin_base_kinetics400 VideoSwinB 87.64M A base Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.6% top5 accuracy on the Kinetics 400 dataset
videoswin_small_kinetics400 VideoSwinS 49.51M A small Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 80.6% top1 and 94.5% top5 accuracy on the Kinetics 400 dataset
videoswin_tiny_kinetics400 VideoSwinT 27.85M A tiny Video Swin backbone architecture. It is pretrained on ImageNet 1K dataset, and trained on Kinetics 400 dataset.
videoswin_tiny VideoSwinT 27.85M A tiny Video Swin backbone architecture.
videoswin_small VideoSwinS 49.51M A small Video Swin backbone architecture.
videoswin_base VideoSwinB 87.64M A base Video Swin backbone architecture.
videoswin_base_kinetics400_imagenet22k VideoSwinB 87.64M A base Video Swin backbone architecture. It is pretrained on ImageNet 22K dataset, and trained on Kinetics 400 dataset. Published weight is capable of scoring 82.7% top1 and 95.5% top5 accuracy on the Kinetics 400 dataset
videoswin_base_kinetics600_imagenet22k VideoSwinB 87.64M A base Video Swin backbone architecture. It is pretrained on ImageNet 22K dataset, and trained on Kinetics 600 dataset. Published weight is capable of scoring 84.0% top1 and 96.5% top5 accuracy on the Kinetics 600 dataset
videoswin_base_something_something_v2 VideoSwinB 87.64M A base Video Swin backbone architecture. It is pretrained on Kinetics 400 dataset, and trained on Something Something V2 dataset. Published weight is capable of scoring 69.6% top1 and 92.7% top5 accuracy on the Kinetics 400 dataset
vitdet_base_sa1b VitDet 89.67M A base Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_huge_sa1b VitDet 637.03M A huge Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_large_sa1b VitDet 308.28M A large Detectron2 ViT backbone trained on the SA1B dataset.
vitdet_base VitDet 89.67M Detectron2 ViT basebone with 12 transformer encoders with embed dim 768 and attention layers with 12 heads with global attention on encoders 2, 5, 8, and 11.
vitdet_large VitDet 308.28M Detectron2 ViT basebone with 24 transformer encoders with embed dim 1024 and attention layers with 16 heads with global attention on encoders 5, 11, 17, and 23.
vitdet_huge VitDet 637.03M Detectron2 ViT basebone model with 32 transformer encoders with embed dim 1280 and attention layers with 16 heads with global attention on encoders 7, 15, 23, and 31.
yolo_v8_xs_backbone YOLOV8 1.28M An extra small YOLOV8 backbone
yolo_v8_s_backbone YOLOV8 5.09M A small YOLOV8 backbone
yolo_v8_m_backbone YOLOV8 11.87M A medium YOLOV8 backbone
yolo_v8_l_backbone YOLOV8 19.83M A large YOLOV8 backbone
yolo_v8_xl_backbone YOLOV8 30.97M An extra large YOLOV8 backbone
yolo_v8_xs_backbone_coco YOLOV8 1.28M An extra small YOLOV8 backbone pretrained on COCO
yolo_v8_s_backbone_coco YOLOV8 5.09M A small YOLOV8 backbone pretrained on COCO
yolo_v8_m_backbone_coco YOLOV8 11.87M A medium YOLOV8 backbone pretrained on COCO
yolo_v8_l_backbone_coco YOLOV8 19.83M A large YOLOV8 backbone pretrained on COCO
yolo_v8_xl_backbone_coco YOLOV8 30.97M An extra large YOLOV8 backbone pretrained on COCO
center_pillar_waymo_open_dataset Unknown 1.28M An example CenterPillar backbone for WOD.

Task presets

Each of the following preset name corresponds to a configuration and weights for a task model. These models are application-ready, but can be further fine-tuned if desired.

The names below can be used with the from_preset() constructor for the corresponding task models.

object_detector = keras_cv.models.RetinaNet.from_preset(
    "retinanet_resnet50_pascalvoc",
    bounding_box_format="xywh",
)

Note that all backbone presets are also applicable to the tasks. For example, you can directly use a ResNetBackbone preset with the RetinaNet. In this case, fine-tuning is necessary since task-specific layers will be randomly initialized.

backbone = keras_cv.models.RetinaNet.from_preset(
    "resnet50_imagenet",
    bounding_box_format="xywh",
)

For brevity, we do not include the backbone presets in the following table.

Note: All pretrained weights should be used with unnormalized pixel intensities in the range [0, 255] if include_rescaling=True or in the range [0, 1] if including_rescaling=False.

{{task_presets_table}}

API Documentation

Tasks

Backbones