Model Card for Finetuned-DINOv2-Chest-CT
This repository hosts the backbone weights for a foundational Vision Transformer (ViT-Large with Registers) fine-tuned on Chest CT scans using the self-supervised DINOv2 algorithm (incorporating iBOT masked image modeling).
Unlike our models trained from scratch using LeJEPA, this model was initialized from pretrained weights and continually pre-trained on the medical domain. It leverages the native Hugging Face transformers ecosystem, making it seamlessly compatible with standard deployment pipelines.
This model was developed by the Institute for Biomedical Informatics Center for Applied AI (IBI-CAAI) at the University of Kentucky to serve as a robust feature extractor for downstream medical imaging tasks, including segmentation, multi-instance learning (MIL), and anomaly detection.
Model Details
- Model Type: Vision Transformer with Registers (
Dinov2WithRegistersModel, ViT-Large). - Developed by: Institute for Biomedical Informatics Center for Applied AI (IBI-CAAI)
- Model Date: 04/2026
- Base Model Architecture: Hugging Face DINOv2. Initialized from pretrained weights and fine-tuned.
- Input: 1-channel Grayscale CT Image.
- Output: Class token, register tokens, and patch tokens.
- Embedding Dimension: 1024
- Patch Size: 14
- Register Tokens: 4
- Image Size Compatibility: Native
518x518resolution. Standard Hugging Face DINOv2 models support variable input sizes, provided the height and width are multiples of the patch size (14). - License: CC BY-NC-SA 4.0 (Inherited from the CT-RATE dataset terms).
Intended Uses
This model is intended for research purposes in the field of medical imaging and radiology.
- Primary Intended Uses:
- Feature extraction for quantitative analysis of Chest CT scans.
- Foundational backbone for downstream models predicting organ anomalies, segmentation, or volume-level analysis via MIL.
Training Data
- Dataset(s): The model was trained exclusively on the train split of the CT-RATE dataset.
- Preprocessing: Hounsfield Units (HU) were strictly clipped between
[-997.0, 888.0]. These values correspond to the 0.5% and 99.5% pixel intensities of the foreground voxels calculated on a subset of the CT-RATE dataset. The clipped values were mapped to a[0, 1]range, followed by Z-score normalization utilizing a dataset mean of-142.39and standard deviation of360.97.
Training Procedure
- Training System/Framework: Fully Sharded Data Parallel (FSDP) utilizing
bf16mixed precision. - Hardware & Scale: The model was fine-tuned for 20,000 iterations. The configuration utilized a batch size of 48 per GPU (DDP across 8 H200 GPUs), a peak learning rate of
2.0e-04(decaying to1.0e-05), and a 3,300-step warmup. - Training Strategy: Global and local crops were sampled from within a 12mm physical slab rather than a single 2D plane to ensure anatomical awareness.
- Self-Supervised Objectives: * DINO Loss: Distillation from a momentum teacher network.
- iBOT Loss: Masked image modeling with a mask sample probability of 0.5 and a mask ratio between 10% and 50%.
Data Augmentation Pipeline
A specialized, GPU-accelerated augmentation pipeline generated the multi-crop views required for the DINOv2 architecture.
Spatial Cropping
- Global Crops: 2 global crops generated per volume, sized at 224x224 pixels, with a random scale between 40% and 100% of the original image dimensions.
- Local Crops: 14 local crops generated per volume (8 standard local crops + 6 guided local crops), sized at 144x144 pixels, with a random scale between 20% and 40%.
How to Get Started with the Model
Because CT scans require strict Hounsfield Unit (HU) windowing and normalization to match the training distribution, you must apply the specific preprocessing logic below.
Unlike the timm-based LeJEPA models, this model is natively supported by the Hugging Face transformers library. The repository contains the necessary config.json and model.safetensors files.
import torch
import torch.nn.functional as F
import numpy as np
from transformers import AutoModel
class CTInferenceTransform:
"""
Applies the exact HU windowing and Z-score normalization used during training.
"""
def __init__(self):
self.clip_min = -997.0
self.clip_max = 888.0
self.mean_hu = -142.39
self.std_hu = 360.97
self.patch_size = 14
# Calculate 0-1 scaled mean and std
range_val = self.clip_max - self.clip_min
self.norm_mean = (self.mean_hu - self.clip_min) / range_val
self.norm_std = self.std_hu / range_val
def __call__(self, volume):
# Expects a 2D numpy array or torch tensor (H, W) in Hounsfield Units
if isinstance(volume, np.ndarray):
volume = torch.from_numpy(volume).float()
if volume.ndim == 2:
volume = volume.unsqueeze(0) # Add channel dim: (1, H, W)
# 1. Clamp HU values and map strictly to [0, 1]
volume = torch.clamp(volume, self.clip_min, self.clip_max)
range_val = self.clip_max - self.clip_min
volume = (volume - self.clip_min) / range_val
# 2. Z-score standardization
volume = (volume - self.norm_mean) / self.norm_std
# 3. Padding/Interpolation for strict patch size alignment
# HF DINOv2 expects dimensions to be multiples of the patch size (14)
C, H, W = volume.shape
target_h = int((H // self.patch_size) * self.patch_size)
target_w = int((W // self.patch_size) * self.patch_size)
if target_h != H or target_w != W:
volume = volume.unsqueeze(0) # (1, C, H, W)
# Use nearest interpolation to prevent averaging of exact HU values
volume = F.interpolate(volume, size=(target_h, target_w), mode='nearest')
volume = volume.squeeze(0)
# Returns (1, 1, H, W). For batched inference, stack these along dim=0.
return volume.unsqueeze(0)
def load_finetuned_dinov2_ct(repo_id="IBI-CAAI/Finetuned-DINOv2-Chest-CT"):
"""
Downloads and initializes the ViT-Large backbone using Hugging Face transformers.
"""
# The config.json in the HF repo handles architecture setup (1 in_chan, 518 native size)
model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
model.eval()
return model
if __name__ == "__main__":
# Initialize the transform and the model
transform = CTInferenceTransform()
model = load_finetuned_dinov2_ct()
# Simulate a raw CT slice (Replace this with an actual NIfTI/DICOM load in Hounsfield Units)
raw_ct_slice = np.random.uniform(-1000, 1000, size=(512, 512))
# Process the image to ensure correct normalization
input_tensor = transform(raw_ct_slice)
# Extract embeddings
with torch.no_grad():
outputs = model(pixel_values=input_tensor)
# DINOv2 returns last_hidden_state, pooler_output, etc.
# last_hidden_state includes the [CLS] token, register tokens, and spatial patch tokens
hidden_states = outputs.last_hidden_state
# [CLS] token is the first token
cls_token = hidden_states[:, 0, :]
# Register tokens (4 tokens based on config)
register_tokens = hidden_states[:, 1:5, :]
# Dense patch tokens (for fine-grained tasks like Segmentation)
patch_tokens = hidden_states[:, 5:, :]
print(f"Input tensor shape: {input_tensor.shape}")
print(f"Full hidden states shape: {hidden_states.shape}")
print(f"CLS token shape: {cls_token.shape}")
print(f"Register tokens shape: {register_tokens.shape}")
print(f"Dense patch tokens shape: {patch_tokens.shape}")
- Downloads last month
- 47
Model tree for IBI-CAAI/Finetuned-DINOv2-Chest-CT
Base model
facebook/dinov2-with-registers-large