Step 3: Extract Latent Features¶
The 3. Extract Latent tab uses visual foundation models to extract feature representations from tracked ROIs. These features encode the animal's posture and movement in each frame as a high-dimensional vector.
Overview¶
Feature extraction transforms your tracked video into numerical data suitable for clustering:
Configuration¶
When you switch to the Extract Latent tab, the interface shows three columns:
Model & Target Settings (Left Column)¶
| Parameter | Description | Default |
|---|---|---|
| Select Visual Model | Feature extraction backbone | dinov2_vitb14_reg4_pretrain |
| Enter ROI ID | Which tracked ROI to extract features from | 1 |
| Batch size | Frames processed per batch (increase if VRAM allows) | 32 |
| Select Target Video | Specific video or "All" | All |
| Skip existing files | Don't re-extract if output already exists | ✅ Enabled |
Available models:
dinov2_vitb14_reg4_pretrain— Meta's DINOv2 ViT-B/14 with registers (default, well-tested)dinov3_vitb16— DINOv3 ViT-B/16 (newer, potentially better representations)dinov3_vitl16— DINOv3 ViT-L/16 (larger model, higher quality but slower)
Preprocessing Settings (Middle Column)¶
These settings control how frames are preprocessed before feature extraction:
| Parameter | Description | Default |
|---|---|---|
| Center ROI | Crop frames centered on a reference ROI | False |
| Center ROI ID | Which ROI to center on | 1 |
| Width / Height | Crop dimensions in pixels | 300 × 300 |
| Rotate based on Tail | Normalize orientation using a tail ROI | False |
| Tail ROI ID | Which ROI defines the tail direction | 2 |
| Remove Background | Mask out pixels outside the ROI | False |
Click Apply First
After changing preprocessing settings, you must click the Apply button before extracting. The preview image shows the result of your preprocessing configuration on the first frame.

Preprocessing Recommendations¶
| Scenario | Center ROI | Rotate | Remove BG |
|---|---|---|---|
| General behavior | ✅ On | ❌ Off | ❌ Off |
| Posture analysis | ✅ On | ✅ On | ✅ On |
| Locomotion patterns | ✅ On | ✅ On | ❌ Off |
| No preprocessing | ❌ Off | ❌ Off | ❌ Off |
Extraction Types¶
CASTLE offers three extraction modes, each triggered by a different button:
Extract (Standard Latent Extraction)¶
The primary extraction mode. Runs the selected visual model on preprocessed frames and saves latent vectors.
- Output:
.npzfile inproject/latent/model-name/ - Filename pattern:
{video}_ROI_{id}_{model}_{tags}.npz - Tags:
ctr(centered),rmbg(background removed)
Extract Crop Video¶
Exports the preprocessed (centered, rotated, cropped) video as an MP4 file. Useful for:
- Visual verification of preprocessing
- Sharing aligned videos with collaborators
-
Input to external analysis tools
-
Output:
.mp4file inproject/crop/video-name/
Extract Rotation Latent¶
Extracts features specifically capturing rotational information. Used when orientation is a key behavioral variable.
- Output:
.npzfile with rotation-specific features
Running Extraction¶
- Configure model, ROI, and preprocessing settings
- Click Apply to confirm preprocessing
- Click Extract (or the appropriate extraction button)
- Monitor progress in the log output area
The log shows:
- Pre-flight check (which videos need processing)
- Per-video progress
- Final summary with success/failure counts

Output Format¶
The standard latent extraction produces .npz files containing:
import numpy as np
data = np.load('video_ROI_1_dinov2_vitb14_reg4_pretrain.npz')
latent_vectors = data['latent'] # Shape: (n_frames, feature_dim)
- Feature dimension: depends on the model (768 for ViT-B, 1024 for ViT-L)
- NaN values: frames where the ROI mask was empty produce NaN vectors
Processing Time¶
Processing time depends on video length, GPU, and model size.
[HUMAN TO CONFIRM: Approximate benchmarks for different configurations]
Speed Tips
- Increase batch size if you have spare VRAM (e.g., 64 or 128)
- Use Skip existing when re-running after adding new videos
dinov2_vitb14_reg4_pretrainis fastest;dinov3_vitl16is slowest but potentially highest quality
Next Step¶
Once features are extracted for all videos, proceed to Step 4: Behavior Analysis.