Skip to content

AIDASLab/VIRST

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation

CVPR 2026

VIRST architecture figure

Official implementation of VIRST, a video-instructed reasoning framework for spatiotemporal segmentation.

TODO

  • release model code
  • release checkpoint
  • release data code
  • release utility scripts
  • release eval script
  • release training scripts
  • demo script

Overview

This repository contains the core training and evaluation code for VIRST, including:

  • model definition in model/
  • training entrypoints in train.py and train_stage3.py
  • RVOS evaluation in eval.py
  • dataset handling in data/
  • utility code in utils/

Installation

conda create -n virst python=3.10 -y 
conda activate virst
pip install -r requirements.txt

Checkpoint

Pretrained checkpoint: Google Drive

Dataset

  • Download Ref-DAVIS, Ref-YouTube-VOS, MeViS, ReVOS
  • Store them in the following directory
RVOS_ROOT
├── ReVOS
│   ├── JPEGImages 
│   ├── mask_dict.json             
│   ├── mask_dict_foreground.json   
│   ├── meta_expressions_train_.json 
│   └── meta_expressions_valid_.json 
├── lvvis
│   └── train
|       ├── JPEGImages
|       ├── mask_dict.json
|       └── meta_expressions.json
├── Ref-Youtube-VOS
│   ├── meta_expressions
|   |   ├── train/meta_expressions.json
|   |   └── valid/meta_expressions.json
│   ├── train
|   |   ├── JPEGImages
|   |   └── mask_dict.pkl
│   └── valid
|       └── JPEGImages
├── davis17
│   ├── meta_expressions
|   |   ├── train/meta_expressions.json
|   |   └── valid/meta_expressions.json
│   ├── train
|   |   ├── JPEGImages
|   |   └── mask_dict.pkl
│   └── valid
|       ├── JPEGImages
|       └── mask_dict.pkl
└── mevis

Evaluation

To be added

Notes

  • The project page will be updated as the release is polished further.

Acknowledgements

This project builds upon prior work, including VISA, LISA, VideoChat-Flash, and SAM2.

We thank the authors for releasing their code and models.

About

[CVPR 2026] Official Implementation for "VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages