Model Card for Finetuned-DINOv2-Chest-CT

This repository hosts the backbone weights for a foundational Vision Transformer (ViT-Large with Registers) fine-tuned on Chest CT scans using the self-supervised DINOv2 algorithm (incorporating iBOT masked image modeling).

Unlike our models trained from scratch using LeJEPA, this model was initialized from pretrained weights and continually pre-trained on the medical domain. It leverages the native Hugging Face transformers ecosystem, making it seamlessly compatible with standard deployment pipelines.

This model was developed by the Institute for Biomedical Informatics Center for Applied AI (IBI-CAAI) at the University of Kentucky to serve as a robust feature extractor for downstream medical imaging tasks, including segmentation, multi-instance learning (MIL), and anomaly detection.

Model Details

Model Type: Vision Transformer with Registers (Dinov2WithRegistersModel, ViT-Large).
Developed by: Institute for Biomedical Informatics Center for Applied AI (IBI-CAAI)
Model Date: 04/2026
Base Model Architecture: Hugging Face DINOv2. Initialized from pretrained weights and fine-tuned.
Input: 1-channel Grayscale CT Image.
Output: Class token, register tokens, and patch tokens.
Embedding Dimension: 1024
Patch Size: 14
Register Tokens: 4
Image Size Compatibility: Native 518x518 resolution. Standard Hugging Face DINOv2 models support variable input sizes, provided the height and width are multiples of the patch size (14).
License: CC BY-NC-SA 4.0 (Inherited from the CT-RATE dataset terms).

Intended Uses

This model is intended for research purposes in the field of medical imaging and radiology.

Primary Intended Uses:
- Feature extraction for quantitative analysis of Chest CT scans.
- Foundational backbone for downstream models predicting organ anomalies, segmentation, or volume-level analysis via MIL.

Training Data

Dataset(s): The model was trained exclusively on the train split of the CT-RATE dataset.
Preprocessing: Hounsfield Units (HU) were strictly clipped between [-997.0, 888.0]. These values correspond to the 0.5% and 99.5% pixel intensities of the foreground voxels calculated on a subset of the CT-RATE dataset. The clipped values were mapped to a [0, 1] range, followed by Z-score normalization utilizing a dataset mean of -142.39 and standard deviation of 360.97.

Training Procedure

Training System/Framework: Fully Sharded Data Parallel (FSDP) utilizing bf16 mixed precision.
Hardware & Scale: The model was fine-tuned for 20,000 iterations. The configuration utilized a batch size of 48 per GPU (DDP across 8 H200 GPUs), a peak learning rate of 2.0e-04 (decaying to 1.0e-05), and a 3,300-step warmup.
Training Strategy: Global and local crops were sampled from within a 12mm physical slab rather than a single 2D plane to ensure anatomical awareness.
Self-Supervised Objectives: * DINO Loss: Distillation from a momentum teacher network.
- iBOT Loss: Masked image modeling with a mask sample probability of 0.5 and a mask ratio between 10% and 50%.

Data Augmentation Pipeline

A specialized, GPU-accelerated augmentation pipeline generated the multi-crop views required for the DINOv2 architecture.

Spatial Cropping

Global Crops: 2 global crops generated per volume, sized at 224x224 pixels, with a random scale between 40% and 100% of the original image dimensions.
Local Crops: 14 local crops generated per volume (8 standard local crops + 6 guided local crops), sized at 144x144 pixels, with a random scale between 20% and 40%.

How to Get Started with the Model

Because CT scans require strict Hounsfield Unit (HU) windowing and normalization to match the training distribution, you must apply the specific preprocessing logic below.

Unlike the timm-based LeJEPA models, this model is natively supported by the Hugging Face transformers library. The repository contains the necessary config.json and model.safetensors files.

import torch
import torch.nn.functional as F
import numpy as np
from transformers import AutoModel

class CTInferenceTransform:
    """ 
    Applies the exact HU windowing and Z-score normalization used during training. 
    """
    def __init__(self):
        self.clip_min = -997.0
        self.clip_max = 888.0
        self.mean_hu = -142.39
        self.std_hu = 360.97
        self.patch_size = 14
        
        # Calculate 0-1 scaled mean and std
        range_val = self.clip_max - self.clip_min
        self.norm_mean = (self.mean_hu - self.clip_min) / range_val
        self.norm_std = self.std_hu / range_val

    def __call__(self, volume):
        # Expects a 2D numpy array or torch tensor (H, W) in Hounsfield Units
        if isinstance(volume, np.ndarray):
            volume = torch.from_numpy(volume).float()
        if volume.ndim == 2:
            volume = volume.unsqueeze(0) # Add channel dim: (1, H, W)

        # 1. Clamp HU values and map strictly to [0, 1]
        volume = torch.clamp(volume, self.clip_min, self.clip_max)
        range_val = self.clip_max - self.clip_min
        volume = (volume - self.clip_min) / range_val

        # 2. Z-score standardization
        volume = (volume - self.norm_mean) / self.norm_std

        # 3. Padding/Interpolation for strict patch size alignment
        # HF DINOv2 expects dimensions to be multiples of the patch size (14)
        C, H, W = volume.shape
        target_h = int((H // self.patch_size) * self.patch_size)
        target_w = int((W // self.patch_size) * self.patch_size)

        if target_h != H or target_w != W:
            volume = volume.unsqueeze(0) # (1, C, H, W)
            # Use nearest interpolation to prevent averaging of exact HU values
            volume = F.interpolate(volume, size=(target_h, target_w), mode='nearest')
            volume = volume.squeeze(0)

        # Returns (1, 1, H, W). For batched inference, stack these along dim=0.
        return volume.unsqueeze(0) 

def load_finetuned_dinov2_ct(repo_id="IBI-CAAI/Finetuned-DINOv2-Chest-CT"):
    """
    Downloads and initializes the ViT-Large backbone using Hugging Face transformers.
    """
    # The config.json in the HF repo handles architecture setup (1 in_chan, 518 native size)
    model = AutoModel.from_pretrained(repo_id, trust_remote_code=True)
    model.eval()
    
    return model

if __name__ == "__main__":
    # Initialize the transform and the model
    transform = CTInferenceTransform()
    model = load_finetuned_dinov2_ct()
    
    # Simulate a raw CT slice (Replace this with an actual NIfTI/DICOM load in Hounsfield Units)
    raw_ct_slice = np.random.uniform(-1000, 1000, size=(512, 512)) 
    
    # Process the image to ensure correct normalization
    input_tensor = transform(raw_ct_slice)
    
    # Extract embeddings
    with torch.no_grad():
        outputs = model(pixel_values=input_tensor)
        
        # DINOv2 returns last_hidden_state, pooler_output, etc.
        # last_hidden_state includes the [CLS] token, register tokens, and spatial patch tokens
        hidden_states = outputs.last_hidden_state
        
        # [CLS] token is the first token
        cls_token = hidden_states[:, 0, :]
        
        # Register tokens (4 tokens based on config)
        register_tokens = hidden_states[:, 1:5, :]
        
        # Dense patch tokens (for fine-grained tasks like Segmentation)
        patch_tokens = hidden_states[:, 5:, :]
        
    print(f"Input tensor shape: {input_tensor.shape}")  
    print(f"Full hidden states shape: {hidden_states.shape}")
    print(f"CLS token shape: {cls_token.shape}")
    print(f"Register tokens shape: {register_tokens.shape}")
    print(f"Dense patch tokens shape: {patch_tokens.shape}")