Workflow Overview¶
What CASTLE Does¶
CASTLE (Clustering Animal behavior with Scalable Training-free Latent Embeddings) takes raw video of animal behavior and produces:
- Segmented and tracked regions of interest (ROIs) — precise masks of animals or body parts across all frames
- Visual feature representations — high-dimensional latent vectors capturing posture and movement
- Behavioral clusters — unsupervised discovery of behavioral syllables
- Visualizations — UMAP embeddings, ethograms, and cluster summaries for exploration and publication
All of this is achieved without any training data — CASTLE leverages pretrained foundation models to work out of the box on any species or experimental setup.
The Pipeline¶

1. Segmentation (SAM)¶
The user marks regions of interest on a reference frame — clicking on the animal's body, head, or other features. The Segment Anything Model (SAM) generates precise segmentation masks from these clicks.
- Point-and-click interface: click to add, click to remove
- Multiple ROIs per frame (e.g., body centroid + head + tail)
- Labels are saved as
.npzfiles containing frame and mask data
2. Tracking (DeAOT)¶
The initial masks are propagated across all video frames using DeAOT (Decoupling features in Associating Objects with Transformers).
- Two model options: R50 (faster) and SwinB (more accurate)
- Handles occlusion, deformation, and appearance changes
- Real-time progress monitoring with cancel capability
- Iterative refinement: add labels on failure frames and re-track
3. Video Alignment¶
Before feature extraction, tracked ROIs are preprocessed:
- Center ROI: crop the video around a reference ROI (e.g., body centroid)
- Rotate: normalize orientation using a secondary ROI (e.g., tail direction)
- Remove background: mask out non-ROI pixels
This normalization ensures that features reflect posture and movement, not position or orientation in the frame.
4. Feature Extraction (DINOv2 / DINOv3)¶
Visual foundation models extract latent features from each aligned frame:
- DINOv2 ViT-B/14 — Meta's self-supervised vision transformer (default)
- DINOv3 ViT-B/16 and DINOv3 ViT-L/16 — newer models with improved representations
- Each frame produces a high-dimensional feature vector
- ROI masking ensures only the animal contributes to the representation
- Batch processing with configurable batch size
5. Behavior Analysis (UMAP + DBSCAN)¶
The high-dimensional features are reduced and clustered to discover behavioral patterns:
- UMAP (Uniform Manifold Approximation and Projection) reduces dimensions for visualization
- DBSCAN clusters the embedding into behavioral syllables
- Hierarchical exploration: three magnification levels (low → intermediate → high) for progressively finer behavioral categories
- Interactive click-to-explore on the UMAP plot
GUI vs Programmatic¶
CASTLE offers two interfaces:
Gradio GUI (Recommended)¶
Interactive web interface at http://localhost:7860 with 5 tabs following the pipeline. Best for most users and covered in this tutorial series.
Jupyter Notebooks¶
The notebooks/ directory contains step-by-step notebooks for programmatic use:
| Notebook | Description |
|---|---|
step1_image_segment.ipynb |
Interactive segmentation |
step2_video_segment.ipynb |
Video tracking |
step3_video_align.ipynb |
Video alignment |
step4_latent_extraction.ipynb |
Feature extraction |
step5_latent_explore.ipynb |
UMAP + clustering |
Best for custom workflows, batch processing, or integration with other analysis pipelines.
Next Steps¶
Follow the tutorials in order: