Drag the slider to compare noisy input (left) and denoised result (right).
Use the arrows to switch between examples.
Recently, open-vocabulary semantic segmentation has garnered growing attention. Most current methods leverage vision-language models like CLIP to recognize unseen categories through their zero-shot capabilities. However, CLIP struggles to establish potential spatial dependencies among scene objects due to its holistic pre-training objective, causing sub-optimal results. In this paper, we propose a DEnoising learning framework based on the Diffusion model for Open-vocabulary semantic Segmentation, called DEDOS, which is aimed at constructing the scene skeleton. Motivation stems from the fact that diffusion models incorporate not only the visual appearance of objects but also embed rich scene spatial priors. Our core idea is to view images as labels embedded with "noise"--non-essential details for perceptual tasks--and to disentangle the intrinsic scene prior from the diffusion feature during the denoising process of the images. Specifically, to fully harness the scene prior knowledge of the diffusion model, we introduce learnable proxy queries during the denoising process. Meanwhile, we leverage the robustness of CLIP features to texture shifts as supervision, guiding proxy queries to focus on constructing the scene skeleton and avoiding interference from texture information in the diffusion feature space. Finally, we enhance spatial understanding within CLIP features using proxy queries, which also serve as an interface for multi-level interaction between text and visual modalities. Extensive experiments validate the effectiveness of our method, experimental results on five standard benchmarks have shown that DEDOS achieves state-of-the-art performance.
A brief illustration of our proposed framework. Our method introduces learnable proxy queries to progressively construct the scene skeleton during the denoising process of the diffusion model. Next, we leverage the robustness of CLIP’s visual features to texture shifts to disentangle the diffusion feature space. Finally, we enhance the scene perception within CLIP features using optimized proxy queries, which also serve as an interface to facilitate multi-level interactions between textual and visual modalities.
Quantitative comparison with state-of-the-art methods on standard benchmarks. A, PC, and PAS denote ADE20K, Pascal Context, and Pascal VOC, respectively. The best results are highlighted in bold.
Quantitative comparison with state-of-the-art methods on MESS. MESS covers diverse domain-specific datasets, which present significant challenges due to their differences from the training dataset. We present the average score for each domain. See supplementary material for detailed results. \textit{Random} denotes the lower bound from uniform distributed prediction, while \textit{Best supervised} represents the upper bound for dataset performance. The best results are highlighted in bold.
Qualitative results on ADE20K validation sets. DEDOS demonstrates more accurate category predictions and more complete spatial distributions. The supplement contains more visual results.
@InProceedings{li2025images,
author = {Li, Fan and Wang, Xuanbin and Wang, Xuan and Zhang, Zhaoxiang and Xu, Yuelei},
title = {Images as Noisy Labels: Unleashing the Potential of the Diffusion Model for Open-Vocabulary Semantic Segmentation},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025},
pages = {24255-24265}
}