No Object Is an Island: Enhancing 3D Semantic Segmentation Generalization with Diffusion Models

NeurIPS 2025

Fan Li, Xuan Wang, Xuanbin Wang, Zhaoxiang Zhang, Yuelei Xu
Northwestern Polytechnical University

Abstract

Enhancing the cross-domain generalization of 3D semantic segmentation is a pivotal task in computer vision that has recently gained increasing attention. Most existing methods, whether using consistency regularization or cross-modal feature fusion, focus solely on individual objects while overlooking implicit semantic dependencies among them, resulting in the loss of useful semantic information. Inspired by the diffusion model's ability to flexibly compose diverse objects into high-quality images across varying domains, we seek to harness its capacity for capturing underlying contextual distributions and spatial arrangements among objects to address the challenging task of cross-domain 3D semantic segmentation. In this paper, we propose a novel cross-modal learning framework based on diffusion models to enhance the generalization of 3D semantic segmentation, named XDiff3D. XDiff3D comprises three key ingredients: (1) constructing object agent queries from diffusion features to aggregate instance semantic information; (2) decoupling fine-grained local details from object agent queries to prevent interference with 3D semantic representation; (3) leveraging object agent queries as an interface to enhance the modeling of object semantic dependencies in 3D representations. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art performance across multiple benchmarks in different task settings.

Paradigm Comparison

Existing methods mainly adopt two paradigms: (a) consistency regularization, which enforces alignment across modalities, and (b) cross-modal feature fusion, which integrates multi-modal information. However, both paradigms focus primarily on individual domain-invariant features, overlooking rich semantic dependencies among objects. In contrast, our method (c) leverages object agent queries as an interface to incorporate instance semantic dependencies from diffusion priors into the 3D semantic space, enhancing cross-domain generalization of 3D semantic segmentation.

Method

A brief illustration of our proposed framework. First, we aggregate semantic information of objects within the scene from diffusion features to form the object agent queries. Next, we propose a dual-query mechanism to eliminate local visual details from the object agent queries, preventing interference with 3D semantic representation. Finally, the optimized object agent queries are used as an interface to infuse inter-object semantic dependencies into the 3D representation, guiding the model to learn domain-invariant features.

Quantitative Comparison

Performance comparison of domain adaptive and domain generalized 3D semantic segmentation methods in four typical settings. xM denotes the result which is obtained by taking the mean of the predicted 2D and 3D probabilities after softmax.

Comparison of previous domain generalization methods on SemanticKITTI→SemanticSTF and SynLiDAR→SemanticSTF benchmarks.

Qualitative Results

Qualitative results of DG3SS. From left to right: the visual results predicted by UniDSeg, Ours, and GroundTruth. We deploy the white dash boxes to highlight different prediction parts.

BibTeX

@inproceedings{li2025no,
  title={No Object Is an Island: Enhancing 3D Semantic Segmentation Generalization with Diffusion Models},
  author={Li, Fan and Wang, Xuan and Wang, Xuanbin and Zhang, Zhaoxiang and Xu, Yuelei},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems}
}