PLAN Lab, University of Illinois Urbana-Champaign
- Jun. 6, 2026: Model, training, and evaluation code released alongside the CALICO checkpoint and Mixed Parts dataset.
- Feb. 26, 2025: CALICO was accepted to CVPR 2025.
CALICO is a pixel-grounded Large Vision-Language Model for part-focused semantic co-segmentation. Given multiple images, CALICO identifies, labels, and segments common objects, common object parts, and unique object parts. This enables fine-grained visual comparison across images rather than single-image segmentation alone.
- New Task. We introduce part-focused semantic co-segmentation: segmenting and labeling common objects, common parts, and unique parts across images.
- CALICO Model. We propose a multi-image LVLM with a Correspondence Extraction Module and Correspondence Adaptation Modules for part-level reasoning.
- Efficient Visual Tokens. CALICO uses a Q-Former visual interface to reduce image-token cost while preserving segmentation-grounded reasoning.
- Mixed Parts Dataset. We curate Mixed Parts, a large-scale benchmark built from public part segmentation datasets with logically comparable object pairs.
- Strong Results. CALICO outperforms adapted LVLM baselines on Mixed Parts while fine-tuning only about 0.3% of model parameters.
- Installation: Environment setup, CUDA/PyTorch notes, and a smoke test.
- Data Preparation: Instructions for preparing ADE20KPart234, PartImageNet, COCO2017, PACO-LVIS, and the Mixed Parts annotation bundle.
- Evaluation: Official Mixed Parts evaluation command, outputs, metrics, and useful flags.
- Training: Fine-tuning command, training flags, distributed launch notes, and resume behavior.
- Model Checkpoint: Released merged CALICO checkpoint for evaluation and inference.
- Mixed Parts Dataset: Released annotation bundle for training and evaluation.
CALICO combines multi-image LVLM reasoning with pixel-level segmentation. Images are encoded through an EVA-CLIP/Q-Former visual interface, text outputs include [SEG] tokens, and the corresponding token embeddings are decoded into segmentation masks by a SAM-based mask decoder. CALICO should be loaded through this repository's local model code using the released checkpoint, rather than through generic AutoModel loading.
Key components:
- Q-Former visual interface: queries compact visual tokens from EVA-CLIP image features.
- SAM mask decoder: decodes
[SEG]token embeddings into segmentation masks. - Correspondence Extraction Module (CEM): extracts semantic correspondences between object parts across images.
- Correspondence Adaptation Modules (CAMs): inject correspondence information into selected LLM layers.
- LoRA adapters: fine-tune a small subset of the language model parameters.
Mixed Parts contains multi-image object-part comparison samples for three subtasks: common object co-segmentation, common part co-segmentation, and unique part segmentation. It is curated from ADE20KPart234, PartImageNet, and PACO-LVIS image assets. Prepare the dataset by following docs/DATA.md.
Install CALICO with docs/INSTALL.md, prepare data with docs/DATA.md, then run evaluation on the official Mixed Parts test split:
python evaluate.py \
--merged_ckpt_path PLAN-Lab/CALICO \
--dataset_dir ./data \
--output_save_path ./evaluate_results/calico_mixed_parts \
--val_dataset "MixedPartsObjectVal|MixedPartsPartVal" \
--multi_image_filepath_prefix ./data/mixed_parts_data/mixed_parts_test.json \
--mode test \
--compute_metricsFor more options, see docs/EVALUATION.md. For fine-tuning, see docs/TRAINING.md.
@article{nguyen2025calico,
title={CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models},
author={Nguyen, Kiet A. and Juvekar, Adheesh and Yu, Tianjiao and Wahed, Muntasir and Lourentzou, Ismini},
journal={In Proceedings for the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2025}
}We thank LLaVA, GLaMM, LISA, and SAM for releasing models and code that supported this project.


