Step 2: Track Regions of Interest¶
The 2. Tracking ROIs tab is the most complex part of CASTLE. It contains two major sections: Single Video Tracking (for building and refining your ROI prompts) and Batch Videos Tracking (for scaling to multiple videos).
Overview¶
The tracking workflow has two phases:
Phase 1: Single Video Tracking (Build Your ROI Prompts)¶
- Label ROIs on reference frames using SAM
- Track across all frames using DeAOT
- Review results and fix errors iteratively
Phase 2: Batch Video Tracking (Scale Up)¶
Once prompts work well on one video, apply them to all videos in the project.
Getting Started¶
- Switch to the 2. Tracking ROIs tab
- Select a video from the Select Video dropdown
- Click Edit to load the video

Single Video Tracking¶
Label ROI¶
The Label ROI sub-tab is where you define what to track using point-and-click segmentation.
- Navigate to the frame you want to label using the frame slider
- Click on the animal (or body part) to create a segmentation mask
- SAM generates a mask from your click
- Click mode defaults to Add — each click refines the mask
- Use Change Mode to switch to Remove mode for excluding regions
- To track multiple ROIs (e.g., body + head), click Label Next ROI to start a new ROI
- Click Save ROIs when satisfied

What is an ROI?
An ROI (Region of Interest) is a segmented area you want to track. Common examples:
- Body centroid: the entire animal body
- Head: for head direction analysis
- Tail: used as orientation reference for video alignment
Label on Multiple Frames
For robust tracking, label ROIs on multiple frames — especially frames where the animal's appearance changes significantly (different postures, partial occlusion, etc.). The more diverse your labels, the more robust the tracking.
ROI Prompts¶
The ROI Prompts sub-tab displays a gallery of all saved labels across the project. This is where you can review the diversity of your prompt set.
- Labels are loaded from
.npzfiles in the project'slabel/directory - Each entry shows the frame with the segmentation mask overlaid
- Labels from all videos in the project are shown together
Prompt Guidelines
- Diversity = Stability: varied examples improve tracking robustness
- Fewer = Faster: fewer prompts mean faster execution
- Sweet spot: 5–30 prompts for optimal balance
Tracking¶
The Tracking sub-tab runs DeAOT to propagate masks across frames.
- Select a tracking model:
- R50 (ResNet-50): faster, good for most cases
- SwinB (Swin Transformer-B): more accurate, better for challenging videos
- Set Start Frame and Stop Frame (defaults to the full video)
- Optionally check Skip existing to avoid re-processing
- Click to start tracking
![]()
Monitor in Real-Time
Watch the tracking progress carefully. If you see errors (mask drifting, losing the animal), cancel immediately — don't waste time on bad tracking. Instead:
- Go back to Label ROI
- Label the frame where tracking failed
- Adjust the Start Frame in Tracking settings
- Re-run tracking from that point
Display Timing
During real-time monitoring, the displayed frame and mask may appear off by one frame. This is normal due to processing timing. Verify final results in the View tab after tracking completes.
View¶
The View sub-tab lets you browse tracking results frame by frame. Use the slider to scrub through frames and verify that masks are accurate.
Analysis (Post-Processing)¶
The Analysis sub-tab provides tools for reviewing and analyzing tracking quality after tracking is complete.
Batch Videos Tracking¶
Once your ROI prompts successfully track one video, use Batch Videos Tracking to process all videos in the project.
- Switch to the Batch Videos Tracking sub-tab
- The system uses all saved ROI prompts from the project
- Each video is tracked independently using the shared prompt set
- Progress is displayed for each video

Iterative Refinement
If batch tracking fails on some videos, add labels from those failure frames and re-run. Your prompt set becomes more robust over time.
Tips and Best Practices¶
- Start with one video: build a solid prompt set before scaling to batch
- Quality over quantity: a few well-chosen prompts across diverse frames work better than many similar ones
- R50 vs SwinB: start with R50 for speed; switch to SwinB if tracking quality is insufficient
- Frame selection: label frames with different postures, lighting conditions, and occlusion scenarios
- Cancel early: if tracking starts failing, cancel and add more labels rather than waiting for it to finish
Next Step¶
Once all videos are tracked, proceed to Step 3: Extract Features.