Domain Generalized Semantic Segmentation (DGSS) trains a model on a labeled source domain to generalize to unseen target domains with consistent contextual distribution and varying visual appearance. Most existing methods rely on domain randomization or data generation but struggle to capture the underlying scene distribution, resulting in the loss of useful semantic information. Inspired by the diffusion model's capability to generate diverse variations within a given scene context, we consider harnessing its rich prior knowledge of scene distribution to tackle the challenging DGSS task. In this paper, we propose a novel agent \textbf{Query}-driven learning framework based on \textbf{Diff}usion model guidance for DGSS, named QueryDiff. Our recipe comprises three key ingredients: (1) generating agent queries from segmentation features to aggregate semantic information about instances within the scene; (2) learning the inherent semantic distribution of the scene through agent queries guided by diffusion features; (3) refining segmentation features using optimized agent queries for robust mask predictions. Extensive experiments across various settings demonstrate that our method significantly outperforms previous state-of-the-art methods. Notably, it enhances the model's ability to generalize effectively to extreme domains, such as cubist art styles.
Previous methods use the diffusion model as a data generator, which struggles to cover all variations in the target domain, resulting in limited performance. Our method employs agent queries to learn scene distribution knowledge from the diffusion model, capitalizing on the inherent consistency of this distribution across domains to improve segmentation model generalization.
A brief illustration of our proposed framework. First, we use learnable queries to aggregate hierarchical instance features from the segmentation backbone, progressively merging them to form agent queries. Next, we utilize agent queries to learn the scene distribution information embedded in the diffusion features, optimizing their semantic representations through diffusion consistency loss (DCL) that removes visual appearance information irrelevant to the perceptual task from the diffusion features. Finally, we use the optimized agent queries to refine the instance features of the segmentation decoder and output the prediction mask.
Performance comparison between the proposed QueryDiff and existing DGSS methods. VFM refers to the vision foundation model.
Normal-to-adverse datasets generalization.} Comparison with state-of-the-art methods for DGSS on C → {AF, AN, AR, AS}. Models are tested on the validation set of ACDC. VFM refers to the vision foundation model.
Qualitative comparison under G→{C, B, M} generalization setting. From left to right: target image, the visual results predicted by Rein, Ours, and Ground Truth. We deploy the white dash boxes to highlight different prediction parts.
@inproceedings{li2025better,
title={Better to Teach than to Give: Domain Generalized Semantic Segmentation via Agent Queries with Diffusion Model Guidance},
author={Li, Fan and Wang, Xuan and Qi, Min and Zhang, Zhaoxiang and Xu, Yuelei},
booktitle = {Proceedings of the 42nd International Conference on Machine Learning},
pages = {36129--36139},
year = {2025},
volume = {267},
series = {Proceedings of Machine Learning Research},
month = {13--19 Jul},
publisher = {PMLR},
}