Workflow Overview¶

What CASTLE Does¶

CASTLE (Clustering Animal behavior with Scalable Training-free Latent Embeddings) takes raw video of animal behavior and produces:

Segmented and tracked regions of interest (ROIs) — precise masks of animals or body parts across all frames
Visual feature representations — high-dimensional latent vectors capturing posture and movement
Behavioral clusters — unsupervised discovery of behavioral syllables
Visualizations — UMAP embeddings, ethograms, and cluster summaries for exploration and publication

All of this is achieved without any training data — CASTLE leverages pretrained foundation models to work out of the box on any species or experimental setup.

The Pipeline¶

Raw Video → SAM (segment) → DeAOT (track) → Align → DINOv2/v3 (features) → UMAP + DBSCAN (cluster)

CASTLE Pipeline Flowchart

1. Segmentation (SAM)¶

The user marks regions of interest on a reference frame — clicking on the animal's body, head, or other features. The Segment Anything Model (SAM) generates precise segmentation masks from these clicks.

Point-and-click interface: click to add, click to remove
Multiple ROIs per frame (e.g., body centroid + head + tail)
Labels are saved as .npz files containing frame and mask data

2. Tracking (DeAOT)¶

The initial masks are propagated across all video frames using DeAOT (Decoupling features in Associating Objects with Transformers).

Two model options: R50 (faster) and SwinB (more accurate)
Handles occlusion, deformation, and appearance changes
Real-time progress monitoring with cancel capability
Iterative refinement: add labels on failure frames and re-track

3. Video Alignment¶

Before feature extraction, tracked ROIs are preprocessed:

Center ROI: crop the video around a reference ROI (e.g., body centroid)
Rotate: normalize orientation using a secondary ROI (e.g., tail direction)
Remove background: mask out non-ROI pixels

This normalization ensures that features reflect posture and movement, not position or orientation in the frame.

4. Feature Extraction (DINOv2 / DINOv3)¶

Visual foundation models extract latent features from each aligned frame:

DINOv2 ViT-B/14 — Meta's self-supervised vision transformer (default)
DINOv3 ViT-B/16 and DINOv3 ViT-L/16 — newer models with improved representations
Each frame produces a high-dimensional feature vector
ROI masking ensures only the animal contributes to the representation
Batch processing with configurable batch size

5. Behavior Analysis (UMAP + DBSCAN)¶

The high-dimensional features are reduced and clustered to discover behavioral patterns:

UMAP (Uniform Manifold Approximation and Projection) reduces dimensions for visualization
DBSCAN clusters the embedding into behavioral syllables
Hierarchical exploration: three magnification levels (low → intermediate → high) for progressively finer behavioral categories
Interactive click-to-explore on the UMAP plot

GUI vs Programmatic¶

CASTLE offers two interfaces:

Gradio GUI (Recommended)¶

python app.py

Interactive web interface at http://localhost:7860 with 5 tabs following the pipeline. Best for most users and covered in this tutorial series.

Jupyter Notebooks¶

The notebooks/ directory contains step-by-step notebooks for programmatic use:

Notebook	Description
`step1_image_segment.ipynb`	Interactive segmentation
`step2_video_segment.ipynb`	Video tracking
`step3_video_align.ipynb`	Video alignment
`step4_latent_extraction.ipynb`	Feature extraction
`step5_latent_explore.ipynb`	UMAP + clustering

Best for custom workflows, batch processing, or integration with other analysis pipelines.

Next Steps¶

Follow the tutorials in order: