User: The first chair violinist of an orchestra is a vital musical leader with widely ranging responsibilities, from tuning the orchestra to working closely with the conductor. Who is most likely the first violinist in the image? Please output the segmentation mask.

Image 1A Image 1B

User: Please identify the activity people are doing in the image and segment those individuals who have already demonstrated excellent body strength and stability through some kind of professional actions. Lastly, output the mask that includes these individuals.

Image 2A Image 2B

User: I am looking for a gym equipment to do weight training on my arms muscles. Which equipment would most likely draw my attention when I walk into a gym? Please output the segmentation mask.

Image 3A Image 3B

User: Which objects are composed of a short bar with a weight on each end, typically used for weight training? Please find the unracked ones and output the segmentation mask.

Image 4A Image 4B

User: Please segment the person who is shooting in this NBA All-Star game.

Image 5A Image 5B

User: Please segment the food with the highest amount of vitamins.

Image 6A Image 6B

User: A person would like to recharge on some vitamins with fruit, however after a long day of school, they don't want to eat anything that would take too long to prepare, if we were to rank from most easy to hardest, what would be his last choice?

Image 7A Image 7B

Abstract

Reasoning segmentation is an emerging vision-language task that requires generating a segmentation mask from implicit and often ambiguous language queries, enabled by recent advances in Multimodal Large Language Models (MLLMs). However, state-of-the-art training-based approaches often fail in challenging cases that demand higher-level reasoning or external knowledge. In this work, we introduce CoT-Seg, a training-free framework that rethinks reasoning segmentation by combining chain-of-thought reasoning with self-correction. Instead of fine-tuning, CoT-Seg leverages the inherent reasoning ability of pre-trained MLLMs (e.g. GPT-4o) to decompose queries into meta-instructions, extract fine-grained semantics from images, and identify target objects even under implicit or complex prompts. Crucially, CoT-Seg incorporates a self-correction stage: the model evaluates its own segmentation against the original query and reasoning trace, identifies mismatches, and iteratively refines the mask. This tight integration of reasoning and correction significantly improves reliability and robustness, especially in ambiguous or error-prone cases. Furthermore, we extend CoT-Seg with retrieval-augmented reasoning, enabling the system to access external knowledge when the input lacks sufficient information, further enhancing segmentation accuracy. Extensive experiments on ReasonSeg and RefCOCO demonstrate that CoT-Seg consistently outperforms existing baselines while remaining training-free. Our results highlight that combining chain-of-thought reasoning, self-correction, and retrieval augmentation offers a powerful paradigm for advancing reasoning-driven segmentation.

overview

Self-refining process

User: Which item of jewelry in the picture has the potential to contain precious gemstones such as emeralds or turquoise?

Image 1A
Input image
Image 1B
First-turn result
Image 1C
Self-refinement result

User: Please segment the crab.

Image 2A
Input image
Image 2B
First-turn result
Image 2C
Self-refinement result

User: Please segment the fish.

Image 3A
Input image
Image 3B
First-turn result
Image 3C
Self-refinement result

User: Please segment the pipefish in this image

Image 4A
Input image
Image 4B
First-turn result
Image 4C
Self-refinement result

User: A fruit salad is a refreshing and delicious dessert that often consists of a variety of fruits mixed together. What object in the picture could be used to hold and serve such a dessert?

Image 5A
Input image
Image 5B
First-turn result
Image 5C
Self-refinement result

User: What is the object that the person in the picture is holding onto while walking his dog?

Image 6A
Input image
Image 6B
First-turn result
Image 6C
Self-refinement result

User: Please segment the leafy sea dragons in this image.

Image 7A
Input image
Image 7B
First-turn result
Image 7C
Self-refinement result

CoT-Seg introduces a self-correction stage: the model evaluates its own segmentation against the query and reasoning trace, identifies inconsistencies, and refines the output through automatically generated meta-queries. This feedback loop allows the system not only to think through the segmentation process but also to recognize and repair its own mistakes.


Retrieval-augmented reasoning

User: Please segment the Phyllobates Samperi. (Note: A recently discovered species.)

Image 1A Image 1B

User: Please segment the flag of the country that hosted the 2025 G7 summit. (Note: GPT-4o has information up to the 2023 G7 summit.)

Image 2A Image 2B

When the query and image lack sufficient information, CoT-Seg calls an external agent to retrieve relevant knowledge from the web, integrating it into the reasoning process. This augmentation further strengthens its ability to tackle ambiguous or knowledge-intensive cases.

Flexible user controls

overview

CoT-Seg supports diverse control types, such as scribble, bounding box, and point, allowing users to easily interact with.


Explore more results in our paper!

Citation

If you find our work useful in your research, please consider citing: