Model Overview

Model Summary

D-FINE is a family of lightweight, real-time object detection models built on the DETR (DEtection TRansformer) architecture. It achieves outstanding localization precision by redefining the bounding box regression task. D-FINE is a powerful object detector designed for a wide range of computer vision tasks. It's trained on massive image datasets, enabling it to excel at identifying and localizing objects with high accuracy and speed. D-FINE offers a balance of high performance and computational efficiency, making it suitable for both research and deployment in various real-time applications.

Key Features:

  • Transformer-based Architecture: A modern, efficient design based on the DETR framework for direct set prediction of objects.
  • Open Source Code: Code is publicly available, promoting accessibility and innovation.
  • Strong Performance: Achieves state-of-the-art results on object detection benchmarks like COCO for its size.
  • Multiple Sizes: Comes in various sizes (e.g., Nano, Small, Large, X-Large) to fit different hardware capabilities.
  • Advanced Bounding Box Refinement: Instead of predicting fixed coordinates, it iteratively refines probability distributions for precise object localization using Fine-grained Distribution Refinement (FDR).

Training Strategies:

D-FINE is pre-trained on large and diverse datasets like COCO and Objects365. The training process utilizes Global Optimal Localization Self-Distillation (GO-LSD), a bidirectional optimization strategy that transfers localization knowledge from refined distributions in deeper layers to shallower layers. This accelerates convergence and improves the overall performance of the model.

Weights are released under the Apache 2.0 License.

Links

Installation

Keras and KerasHub can be installed with:

pip install -U -q keras-hub
pip install -U -q keras

Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the Keras Getting Started page.

Available D-FINE Presets

The following model checkpoints are provided by the Keras team. Full code examples for each are available below.

Preset   Parameters Description
dfine_nano_coco 3.79M D-FINE Nano model, the smallest variant in the family, pretrained on the COCO dataset. Ideal for applications where computational resources are limited.
dfine_small_coco 10.33M D-FINE Small model pretrained on the COCO dataset. Offers a balance between performance and computational efficiency.
dfine_medium_coco 19.62M D-FINE Medium model pretrained on the COCO dataset. A solid baseline with strong performance for general-purpose object detection.
dfine_large_coco 31.34M D-FINE Large model pretrained on the COCO dataset. Provides high accuracy and is suitable for more demanding tasks.
dfine_xlarge_coco 62.83M D-FINE X-Large model, the largest COCO-pretrained variant, designed for state-of-the-art performance where accuracy is the top priority.
dfine_small_obj365 10.62M D-FINE Small model pretrained on the large-scale Objects365 dataset, enhancing its ability to recognize a wider variety of objects.
dfine_medium_obj365 19.99M D-FINE Medium model pretrained on the Objects365 dataset. Benefits from a larger and more diverse pretraining corpus.
dfine_large_obj365 31.86M D-FINE Large model pretrained on the Objects365 dataset for improved generalization and performance on diverse object categories.
dfine_xlarge_obj365 63.35M D-FINE X-Large model pretrained on the Objects365 dataset, offering maximum performance by leveraging a vast number of object categories during pretraining.
dfine_small_obj2coco 10.33M D-FINE Small model first pretrained on Objects365 and then fine-tuned on COCO, combining broad feature learning with benchmark-specific adaptation.
dfine_medium_obj2coco 19.62M D-FINE Medium model using a two-stage training process: pretraining on Objects365 followed by fine-tuning on COCO.
dfine_large_obj2coco_e25 31.34M D-FINE Large model pretrained on Objects365 and then fine-tuned on COCO for 25 epochs. A high-performance model with specialized tuning.
dfine_xlarge_obj2coco 62.83M D-FINE X-Large model, pretrained on Objects365 and fine-tuned on COCO, representing the most powerful model in this series for COCO-style tasks.

Example Usage

Imports

import keras
import keras_hub
import numpy as np
from keras_hub.models import DFineBackbone
from keras_hub.models import DFineObjectDetector
from keras_hub.models import HGNetV2Backbone

Load a Pretrained Model

Use from_preset() to load a D-FINE model with pretrained weights.

object_detector = DFineObjectDetector.from_preset(
    "dfine_small_obj365"
)

Make a Prediction

Call predict() on a batch of images. The images will be automatically preprocessed.

# Create a random image.
image = np.random.uniform(size=(1, 256, 256, 3)).astype("float32")

# Make predictions.
predictions = object_detector.predict(image)

# The output is a dictionary containing boxes, labels, confidence scores,
# and the number of detections.
print(predictions["boxes"].shape)
print(predictions["labels"].shape)
print(predictions["confidence"].shape)
print(predictions["num_detections"])

Fine-Tune a Pre-trained Model

You can load a pretrained backbone and attach a new detection head for a different number of classes.

# Load a pretrained backbone.
backbone = DFineBackbone.from_preset(
    "dfine_small_obj365"
)

# Create a new detector with a different number of classes for fine-tuning.
finetuning_detector = DFineObjectDetector(
    backbone=backbone,
    num_classes=10  # Example: fine-tuning on a new dataset with 10 classes
)

# The `finetuning_detector` is now ready to be compiled and trained on a new dataset.

Create a Model From Scratch

You can also build a D-FINE detector by first creating its components, such as the underlying HGNetV2Backbone.

# 1. Define a base backbone for feature extraction.
hgnetv2_backbone = HGNetV2Backbone(
    stem_channels=[3, 16, 16],
    stackwise_stage_filters=[
        [16, 16, 64, 1, 3, 3],
        [64, 32, 256, 1, 3, 3],
        [256, 64, 512, 2, 3, 5],
        [512, 128, 1024, 1, 3, 5],
    ],
    apply_downsample=[False, True, True, True],
    use_lightweight_conv_block=[False, False, True, True],
    depths=[1, 1, 2, 1],
    hidden_sizes=[64, 256, 512, 1024],
    embedding_size=16,
    image_shape=(256, 256, 3),
    out_features=["stage3", "stage4"],
)

# 2. Create the D-FINE backbone, which includes the hybrid encoder and decoder.
d_fine_backbone = DFineBackbone(
    backbone=hgnetv2_backbone,
    decoder_in_channels=[128, 128],
    encoder_hidden_dim=128,
    num_denoising=0, # Denoising is off
    num_labels=80,
    hidden_dim=128,
    learn_initial_query=False,
    num_queries=300,
    anchor_image_size=(256, 256),
    feat_strides=[16, 32],
    num_feature_levels=2,
    encoder_in_channels=[512, 1024],
    encode_proj_layers=[1],
    num_attention_heads=8,
    encoder_ffn_dim=512,
    num_encoder_layers=1,
    hidden_expansion=0.34,
    depth_multiplier=0.5,
    eval_idx=-1,
    num_decoder_layers=3,
    decoder_attention_heads=8,
    decoder_ffn_dim=512,
    decoder_n_points=[6, 6],
    lqe_hidden_dim=64,
    num_lqe_layers=2,
    image_shape=(256, 256, 3),
)

# 3. Create the final object detector model.
object_detector_scratch = DFineObjectDetector(
    backbone=d_fine_backbone,
    num_classes=80,
    bounding_box_format="yxyx",
)

Train the Model

Call fit() on a batch of images and ground truth bounding boxes. The compute_loss method from the detector handles the complex loss calculations.

# Prepare sample training data.
images = np.random.uniform(
    low=0, high=255, size=(2, 256, 256, 3)
).astype("float32")
bounding_boxes = {
    "boxes": [
        np.array([[0.1, 0.1, 0.3, 0.3], [0.5, 0.5, 0.8, 0.8]], dtype="float32"),
        np.array([[0.2, 0.2, 0.4, 0.4]], dtype="float32"),
    ],
    "labels": [
        np.array([1, 10], dtype="int32"),
        np.array([20], dtype="int32"),
    ],
}

# Compile the model with the built-in loss function.
object_detector_scratch.compile(
    optimizer="adam",
    loss=object_detector_scratch.compute_loss,
)

# Train the model.
object_detector_scratch.fit(x=images, y=bounding_boxes, epochs=1)

Train with Contrastive Denoising

To enable contrastive denoising for training, provide ground truth labels when initializing the DFineBackbone.

# Sample ground truth labels for initializing the denoising generator.
labels_for_denoising = [
    {
        "boxes": np.array([[0.5, 0.5, 0.2, 0.2]]), "labels": np.array([1])
    },
    {
        "boxes": np.array([[0.6, 0.6, 0.3, 0.3]]), "labels": np.array([2])
    },
]

# Create a D-FINE backbone with denoising enabled.
d_fine_backbone_denoising = DFineBackbone(
    backbone=hgnetv2_backbone, # Using the hgnetv2_backbone from before
    decoder_in_channels=[128, 128],
    encoder_hidden_dim=128,
    num_denoising=100,  # Number of denoising queries
    label_noise_ratio=0.5,
    box_noise_scale=1.0,
    labels=labels_for_denoising, # Provide labels at initialization
    num_labels=80,
    hidden_dim=128,
    learn_initial_query=False,
    num_queries=300,
    anchor_image_size=(256, 256),
    feat_strides=[16, 32],
    num_feature_levels=2,
    encoder_in_channels=[512, 1024],
    encode_proj_layers=[1],
    num_attention_heads=8,
    encoder_ffn_dim=512,
    num_encoder_layers=1,
    hidden_expansion=0.34,
    depth_multiplier=0.5,
    eval_idx=-1,
    num_decoder_layers=3,
    decoder_attention_heads=8,
    decoder_ffn_dim=512,
    decoder_n_points=[6, 6],
    lqe_hidden_dim=64,
    num_lqe_layers=2,
    image_shape=(256, 256, 3),
)

# Create the final detector.
object_detector_denoising = DFineObjectDetector(
    backbone=d_fine_backbone_denoising,
    num_classes=80
)

# This model can now be compiled and trained as shown in the previous example.

Example Usage with Hugging Face URI

Imports

import keras
import keras_hub
import numpy as np
from keras_hub.models import DFineBackbone
from keras_hub.models import DFineObjectDetector
from keras_hub.models import HGNetV2Backbone

Load a Pretrained Model

Use from_preset() to load a D-FINE model with pretrained weights.

object_detector = DFineObjectDetector.from_preset(
    "hf://keras/dfine_small_obj365"
)

Make a Prediction

Call predict() on a batch of images. The images will be automatically preprocessed.

# Create a random image.
image = np.random.uniform(size=(1, 256, 256, 3)).astype("float32")

# Make predictions.
predictions = object_detector.predict(image)

# The output is a dictionary containing boxes, labels, confidence scores,
# and the number of detections.
print(predictions["boxes"].shape)
print(predictions["labels"].shape)
print(predictions["confidence"].shape)
print(predictions["num_detections"])

Fine-Tune a Pre-trained Model

You can load a pretrained backbone and attach a new detection head for a different number of classes.

# Load a pretrained backbone.
backbone = DFineBackbone.from_preset(
    "hf://keras/dfine_small_obj365"
)

# Create a new detector with a different number of classes for fine-tuning.
finetuning_detector = DFineObjectDetector(
    backbone=backbone,
    num_classes=10  # Example: fine-tuning on a new dataset with 10 classes
)

# The `finetuning_detector` is now ready to be compiled and trained on a new dataset.

Create a Model From Scratch

You can also build a D-FINE detector by first creating its components, such as the underlying HGNetV2Backbone.

# 1. Define a base backbone for feature extraction.
hgnetv2_backbone = HGNetV2Backbone(
    stem_channels=[3, 16, 16],
    stackwise_stage_filters=[
        [16, 16, 64, 1, 3, 3],
        [64, 32, 256, 1, 3, 3],
        [256, 64, 512, 2, 3, 5],
        [512, 128, 1024, 1, 3, 5],
    ],
    apply_downsample=[False, True, True, True],
    use_lightweight_conv_block=[False, False, True, True],
    depths=[1, 1, 2, 1],
    hidden_sizes=[64, 256, 512, 1024],
    embedding_size=16,
    image_shape=(256, 256, 3),
    out_features=["stage3", "stage4"],
)

# 2. Create the D-FINE backbone, which includes the hybrid encoder and decoder.
d_fine_backbone = DFineBackbone(
    backbone=hgnetv2_backbone,
    decoder_in_channels=[128, 128],
    encoder_hidden_dim=128,
    num_denoising=0, # Denoising is off
    num_labels=80,
    hidden_dim=128,
    learn_initial_query=False,
    num_queries=300,
    anchor_image_size=(256, 256),
    feat_strides=[16, 32],
    num_feature_levels=2,
    encoder_in_channels=[512, 1024],
    encode_proj_layers=[1],
    num_attention_heads=8,
    encoder_ffn_dim=512,
    num_encoder_layers=1,
    hidden_expansion=0.34,
    depth_multiplier=0.5,
    eval_idx=-1,
    num_decoder_layers=3,
    decoder_attention_heads=8,
    decoder_ffn_dim=512,
    decoder_n_points=[6, 6],
    lqe_hidden_dim=64,
    num_lqe_layers=2,
    image_shape=(256, 256, 3),
)

# 3. Create the final object detector model.
object_detector_scratch = DFineObjectDetector(
    backbone=d_fine_backbone,
    num_classes=80,
    bounding_box_format="yxyx",
)

Train the Model

Call fit() on a batch of images and ground truth bounding boxes. The compute_loss method from the detector handles the complex loss calculations.

# Prepare sample training data.
images = np.random.uniform(
    low=0, high=255, size=(2, 256, 256, 3)
).astype("float32")
bounding_boxes = {
    "boxes": [
        np.array([[0.1, 0.1, 0.3, 0.3], [0.5, 0.5, 0.8, 0.8]], dtype="float32"),
        np.array([[0.2, 0.2, 0.4, 0.4]], dtype="float32"),
    ],
    "labels": [
        np.array([1, 10], dtype="int32"),
        np.array([20], dtype="int32"),
    ],
}

# Compile the model with the built-in loss function.
object_detector_scratch.compile(
    optimizer="adam",
    loss=object_detector_scratch.compute_loss,
)

# Train the model.
object_detector_scratch.fit(x=images, y=bounding_boxes, epochs=1)

Train with Contrastive Denoising

To enable contrastive denoising for training, provide ground truth labels when initializing the DFineBackbone.

# Sample ground truth labels for initializing the denoising generator.
labels_for_denoising = [
    {
        "boxes": np.array([[0.5, 0.5, 0.2, 0.2]]), "labels": np.array([1])
    },
    {
        "boxes": np.array([[0.6, 0.6, 0.3, 0.3]]), "labels": np.array([2])
    },
]

# Create a D-FINE backbone with denoising enabled.
d_fine_backbone_denoising = DFineBackbone(
    backbone=hgnetv2_backbone, # Using the hgnetv2_backbone from before
    decoder_in_channels=[128, 128],
    encoder_hidden_dim=128,
    num_denoising=100,  # Number of denoising queries
    label_noise_ratio=0.5,
    box_noise_scale=1.0,
    labels=labels_for_denoising, # Provide labels at initialization
    num_labels=80,
    hidden_dim=128,
    learn_initial_query=False,
    num_queries=300,
    anchor_image_size=(256, 256),
    feat_strides=[16, 32],
    num_feature_levels=2,
    encoder_in_channels=[512, 1024],
    encode_proj_layers=[1],
    num_attention_heads=8,
    encoder_ffn_dim=512,
    num_encoder_layers=1,
    hidden_expansion=0.34,
    depth_multiplier=0.5,
    eval_idx=-1,
    num_decoder_layers=3,
    decoder_attention_heads=8,
    decoder_ffn_dim=512,
    decoder_n_points=[6, 6],
    lqe_hidden_dim=64,
    num_lqe_layers=2,
    image_shape=(256, 256, 3),
)

# Create the final detector.
object_detector_denoising = DFineObjectDetector(
    backbone=d_fine_backbone_denoising,
    num_classes=80
)

# This model can now be compiled and trained as shown in the previous example.
Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for keras/dfine_small_obj365