CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models [CVPR 2025]

Kiet A. Nguyen, Adheesh Juvekar, Tianjiao Yu, Muntasir Wahed, Ismini Lourentzou

PLAN Lab, University of Illinois Urbana-Champaign

📢 Latest Updates

Jun. 6, 2026: Model, training, and evaluation code released alongside the CALICO checkpoint and Mixed Parts dataset.
Feb. 26, 2025: CALICO was accepted to CVPR 2025.

CALICO Overview

CALICO is a pixel-grounded Large Vision-Language Model for part-focused semantic co-segmentation. Given multiple images, CALICO identifies, labels, and segments common objects, common object parts, and unique object parts. This enables fine-grained visual comparison across images rather than single-image segmentation alone.

🏆 Contributions

New Task. We introduce part-focused semantic co-segmentation: segmenting and labeling common objects, common parts, and unique parts across images.
CALICO Model. We propose a multi-image LVLM with a Correspondence Extraction Module and Correspondence Adaptation Modules for part-level reasoning.
Efficient Visual Tokens. CALICO uses a Q-Former visual interface to reduce image-token cost while preserving segmentation-grounded reasoning.
Mixed Parts Dataset. We curate Mixed Parts, a large-scale benchmark built from public part segmentation datasets with logically comparable object pairs.
Strong Results. CALICO outperforms adapted LVLM baselines on Mixed Parts while fine-tuning only about 0.3% of model parameters.

🚀 Dive Deeper: Code, Data, and Checkpoints

Installation: Environment setup, CUDA/PyTorch notes, and a smoke test.
Data Preparation: Instructions for preparing ADE20KPart234, PartImageNet, COCO2017, PACO-LVIS, and the Mixed Parts annotation bundle.
Evaluation: Official Mixed Parts evaluation command, outputs, metrics, and useful flags.
Training: Fine-tuning command, training flags, distributed launch notes, and resume behavior.
Model Checkpoint: Released merged CALICO checkpoint for evaluation and inference.
Mixed Parts Dataset: Released annotation bundle for training and evaluation.

😸 CALICO Architecture

CALICO combines multi-image LVLM reasoning with pixel-level segmentation. Images are encoded through an EVA-CLIP/Q-Former visual interface, text outputs include [SEG] tokens, and the corresponding token embeddings are decoded into segmentation masks by a SAM-based mask decoder. CALICO should be loaded through this repository's local model code using the released checkpoint, rather than through generic AutoModel loading.

Key components:

Q-Former visual interface: queries compact visual tokens from EVA-CLIP image features.
SAM mask decoder: decodes [SEG] token embeddings into segmentation masks.
Correspondence Extraction Module (CEM): extracts semantic correspondences between object parts across images.
Correspondence Adaptation Modules (CAMs): inject correspondence information into selected LLM layers.
LoRA adapters: fine-tune a small subset of the language model parameters.

🪑 Mixed Parts Dataset

Mixed Parts contains multi-image object-part comparison samples for three subtasks: common object co-segmentation, common part co-segmentation, and unique part segmentation. It is curated from ADE20KPart234, PartImageNet, and PACO-LVIS image assets. Prepare the dataset by following docs/DATA.md.

⚡ Quick Start

Install CALICO with docs/INSTALL.md, prepare data with docs/DATA.md, then run evaluation on the official Mixed Parts test split:

python evaluate.py \
  --merged_ckpt_path PLAN-Lab/CALICO \
  --dataset_dir ./data \
  --output_save_path ./evaluate_results/calico_mixed_parts \
  --val_dataset "MixedPartsObjectVal|MixedPartsPartVal" \
  --multi_image_filepath_prefix ./data/mixed_parts_data/mixed_parts_test.json \
  --mode test \
  --compute_metrics

For more options, see docs/EVALUATION.md. For fine-tuning, see docs/TRAINING.md.

📜 Citation

@article{nguyen2025calico,
  title={CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models},
  author={Nguyen, Kiet A. and Juvekar, Adheesh and Yu, Tianjiao and Wahed, Muntasir and Lourentzou, Ismini},
  journal={In Proceedings for the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}

🙏 Acknowledgement

We thank LLaVA, GLaMM, LISA, and SAM for releasing models and code that supported this project.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
dataset		dataset
docs		docs
model		model
tools		tools
.gitignore		.gitignore
README.md		README.md
evaluate.py		evaluate.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models [CVPR 2025]

Kiet A. Nguyen, Adheesh Juvekar, Tianjiao Yu, Muntasir Wahed, Ismini Lourentzou

PLAN Lab, University of Illinois Urbana-Champaign

📢 Latest Updates

CALICO Overview

🏆 Contributions

🚀 Dive Deeper: Code, Data, and Checkpoints

😸 CALICO Architecture

🪑 Mixed Parts Dataset

⚡ Quick Start

📜 Citation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models [CVPR 2025]

Kiet A. Nguyen, Adheesh Juvekar, Tianjiao Yu, Muntasir Wahed, Ismini Lourentzou

PLAN Lab, University of Illinois Urbana-Champaign

📢 Latest Updates

CALICO Overview

🏆 Contributions

🚀 Dive Deeper: Code, Data, and Checkpoints

😸 CALICO Architecture

🪑 Mixed Parts Dataset

⚡ Quick Start

📜 Citation

🙏 Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages