WildDet3D v1.1 β DROID-finetuned
A SAM3_3D + LingBot-Depth model finetuned on the DROID robotics manipulation dataset (3-view: wrist + ext1 + ext2, ~24k frames) on top of the WildDet3D Stage 3 checkpoint.
Files
wilddet3d_v1.1_droid_finetuned.ckptβ inference-only PyTorch Lightning checkpoint (4.5 GB, optimizer state stripped)
Training Pipeline
WildDet3D Stage 1: Omni3D-only canonical (12 epochs)
-> Stage 2: All-Data Dense FT (12 epochs)
-> Stage 3: Omni3D + canonical_neg_pt FT (3 epochs, point prompts)
-> Stage 4 (this ckpt): + DROID add-on (3 epochs, lr 1e-5 -> 1e-6)
Stage 4 setup:
- Datasets: Omni3D 70% + DROID 30%
- Epoch base = DROID (1 epoch β one DROID pass through 22k samples)
- 8 GPUs Γ samples_per_gpu=4 Γ accumulate=4 = effective batch 128
- Collator:
5mode_neg_pt(text + box + point with negative sampling) - LR schedule:
step_1=0, step_2=1, base_lr=1e-4β epoch 0 at 1e-5, epochs 1-2 at 1e-6 (skip the high-lr stage 1 to preserve Stage 3 ability) - Started from Stage 3 ckpt:
epoch=2-step=7998(canonical_neg_ft_3e_2node)
Eval results β DROID_val (146 unified categories)
| Mode | Stage 3 (baseline) | Stage 4 (this ckpt) | Ξ |
|---|---|---|---|
| 3D AP (bbox) β text + mono depth | 0.0205 | 0.0642 | +213% |
| 3D AP (bbox) β text + GT depth | 0.0582 | 0.0657 | +13% |
| 3D AP (bbox) β oracle 2D box + mono | 0.0119 | 0.0890 | +648% |
| 3D AP (bbox) β oracle 2D box + GT depth | 0.0505 | 0.0972 | +92% |
| 3D AP (dist) β text + mono depth | 0.0676 | 0.2693 | +298% |
| 3D AP (dist) β text + GT depth | 0.2414 | 0.3198 | +32% |
| 3D AP (dist) β oracle 2D box + mono | 0.0400 | 0.3137 | +684% |
| 3D AP (dist) β oracle 2D box + GT depth | 0.1625 | 0.3812 | +135% |
DROID-specific 3D detection capability strongly improved (3-7Γ on most modes) without harming general Omni3D performance (overall AP held steady at ~0.322 across the 3 epochs).
Usage
# vis4d test path (drop-in replacement for Stage 3 ckpt)
vis4d test \
--config opendet3d/zoo/sam3_3d/eval/sam3_3d_lingbot_depth_freeze21_droid_canonical.py \
--gpus 4 \
--ckpt wilddet3d_v1.1_droid_finetuned.ckpt \
--config.params.nms=True \
--config.params.nms_iou_threshold=0.6 \
--config.params.score_threshold=0.05 \
--config.model.init_args.use_presence_score=False
Use the matching SAM3_3D config py file from the WildDet3D codebase.
This checkpoint loads with vis4d test --ckpt for evaluation only;
optimizer state is not included so it cannot resume training.
DROID Conversion
DROID frames are extracted from the droid_pipeline outputs:
- 3 views per episode (wrist + ext1 + ext2 from stereo MP4 left half)
- FoundationStereo dense depth (uint16 mm, max 20 m)
- Per-view gravity normalization in OpenCV camera frame
- Categories unified at eval: 786 raw β 558 head-noun (via suffix matching)