WildDet3D v1.1 β€” DROID-finetuned

A SAM3_3D + LingBot-Depth model finetuned on the DROID robotics manipulation dataset (3-view: wrist + ext1 + ext2, ~24k frames) on top of the WildDet3D Stage 3 checkpoint.

Files

  • wilddet3d_v1.1_droid_finetuned.ckpt β€” inference-only PyTorch Lightning checkpoint (4.5 GB, optimizer state stripped)

Training Pipeline

WildDet3D Stage 1: Omni3D-only canonical (12 epochs)
  -> Stage 2: All-Data Dense FT (12 epochs)
    -> Stage 3: Omni3D + canonical_neg_pt FT (3 epochs, point prompts)
      -> Stage 4 (this ckpt): + DROID add-on (3 epochs, lr 1e-5 -> 1e-6)

Stage 4 setup:

  • Datasets: Omni3D 70% + DROID 30%
  • Epoch base = DROID (1 epoch β‰ˆ one DROID pass through 22k samples)
  • 8 GPUs Γ— samples_per_gpu=4 Γ— accumulate=4 = effective batch 128
  • Collator: 5mode_neg_pt (text + box + point with negative sampling)
  • LR schedule: step_1=0, step_2=1, base_lr=1e-4 β†’ epoch 0 at 1e-5, epochs 1-2 at 1e-6 (skip the high-lr stage 1 to preserve Stage 3 ability)
  • Started from Stage 3 ckpt: epoch=2-step=7998 (canonical_neg_ft_3e_2node)

Eval results β€” DROID_val (146 unified categories)

Mode Stage 3 (baseline) Stage 4 (this ckpt) Ξ”
3D AP (bbox) β€” text + mono depth 0.0205 0.0642 +213%
3D AP (bbox) β€” text + GT depth 0.0582 0.0657 +13%
3D AP (bbox) β€” oracle 2D box + mono 0.0119 0.0890 +648%
3D AP (bbox) β€” oracle 2D box + GT depth 0.0505 0.0972 +92%
3D AP (dist) β€” text + mono depth 0.0676 0.2693 +298%
3D AP (dist) β€” text + GT depth 0.2414 0.3198 +32%
3D AP (dist) β€” oracle 2D box + mono 0.0400 0.3137 +684%
3D AP (dist) β€” oracle 2D box + GT depth 0.1625 0.3812 +135%

DROID-specific 3D detection capability strongly improved (3-7Γ— on most modes) without harming general Omni3D performance (overall AP held steady at ~0.322 across the 3 epochs).

Usage

# vis4d test path (drop-in replacement for Stage 3 ckpt)
vis4d test \
    --config opendet3d/zoo/sam3_3d/eval/sam3_3d_lingbot_depth_freeze21_droid_canonical.py \
    --gpus 4 \
    --ckpt wilddet3d_v1.1_droid_finetuned.ckpt \
    --config.params.nms=True \
    --config.params.nms_iou_threshold=0.6 \
    --config.params.score_threshold=0.05 \
    --config.model.init_args.use_presence_score=False

Use the matching SAM3_3D config py file from the WildDet3D codebase. This checkpoint loads with vis4d test --ckpt for evaluation only; optimizer state is not included so it cannot resume training.

DROID Conversion

DROID frames are extracted from the droid_pipeline outputs:

  • 3 views per episode (wrist + ext1 + ext2 from stereo MP4 left half)
  • FoundationStereo dense depth (uint16 mm, max 20 m)
  • Per-view gravity normalization in OpenCV camera frame
  • Categories unified at eval: 786 raw β†’ 558 head-noun (via suffix matching)
Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading