Official implementation of VIRST, a video-instructed reasoning framework for spatiotemporal segmentation.
- release model code
- release checkpoint
- release data code
- release utility scripts
- release eval script
- release training scripts
- demo script
This repository contains the core training and evaluation code for VIRST, including:
- model definition in
model/ - training entrypoints in
train.pyandtrain_stage3.py - RVOS evaluation in
eval.py - dataset handling in
data/ - utility code in
utils/
conda create -n virst python=3.10 -y
conda activate virst
pip install -r requirements.txt
Pretrained checkpoint: Google Drive
RVOS_ROOT
├── ReVOS
│ ├── JPEGImages
│ ├── mask_dict.json
│ ├── mask_dict_foreground.json
│ ├── meta_expressions_train_.json
│ └── meta_expressions_valid_.json
├── lvvis
│ └── train
| ├── JPEGImages
| ├── mask_dict.json
| └── meta_expressions.json
├── Ref-Youtube-VOS
│ ├── meta_expressions
| | ├── train/meta_expressions.json
| | └── valid/meta_expressions.json
│ ├── train
| | ├── JPEGImages
| | └── mask_dict.pkl
│ └── valid
| └── JPEGImages
├── davis17
│ ├── meta_expressions
| | ├── train/meta_expressions.json
| | └── valid/meta_expressions.json
│ ├── train
| | ├── JPEGImages
| | └── mask_dict.pkl
│ └── valid
| ├── JPEGImages
| └── mask_dict.pkl
└── mevis
To be added
- The project page will be updated as the release is polished further.
This project builds upon prior work, including VISA, LISA, VideoChat-Flash, and SAM2.
We thank the authors for releasing their code and models.