CycleDiff: Cycle Diffusion Models for Unpaired Image-to-image Translation

Overview

We introduce a diffusion-based cross-domain image translator in the absence of paired training data. Unlike GAN-based methods, our approach integrates diffusion models to learn the image translation process, allowing for more coverable modeling of the data distribution and performance improvement of the cross-domain translation. However, incorporating the translation process within the diffusion process is still challenging since the two processes are not aligned exactly, i.e., the diffusion process is applied to the noisy signal while the translation process is conducted on the clean signal. As a result, recent diffusion-based studies employ separate training or shallow integration to learn the two processes, yet this may cause the local minimal of the translation optimization, constraining the effectiveness of diffusion models.

To address the problem, we propose a novel joint learning framework that aligns the diffusion and the translation process, thereby improving the global optimality. Specifically, we propose to extract the image components with diffusion models to represent the clean signal and employ the translation process with the image components, enabling an end-to-end joint learning manner. On the other hand, we introduce a time-dependent translation network to learn the complex translation mapping, resulting in effective translation learning and significant performance improvement. Benefiting from the design of joint learning, our method enables global optimization of both processes, enhancing the optimality and achieving improved fidelity and structural consistency.

We have conducted extensive experiments on RGB⇆RGB and cross-modality translation tasks including RGB⇆Edge, RGB⇆Semantics and RGB⇆Depth, showcasing better generative performances than the state of the arts. Especially, our method achieves the best FID score in widely-adopted tasks and outperforms the second-best method with an improved FID of 19.61 and 19.67 on Dog⇆Cat and Dog⇆Wild respectively.

Method and main results

Method

The overall architecture of CycleDiff. CycleDiff comprises two parts: the diffusion models and the cycle translator. The diffusion models are employed to extract image components, which are then fed into the cycle translator for unpaired translation between two domains. The diffusion and translation processes are learned jointly.

Comparison with other methods

Quantitative comparison of unpaired image-to-image translation methods. The best results are shown in bold, and the second-best results are underline. Note that the KID metric is multiplied by 100.

More Visual Results

Qualitative comparisons on RGB⇆RGB tasks with state-of-the-art methods. CycleDiff could achieve superior visual results for both realism and faithfulness across all tasks.

Qualitative comparison on RGB⇆Edge, RGB⇆Semantics, RGB⇆Depth with state-of-the-art methods. CycleDiff is capable of translating images between various modalities.

Qualitative comparisons on RGB⇆RGB tasks of CycleDiff.

Quantitative results on RGB⇆Edge, RGB⇆Semantics, RGB⇆Depth of CycleDiff.

Other interesting results

☑ Training in a 640 * 320 resolution on RGB⇆Dpeth

Qualitative results on RGB→Depth task with more larger resolution.

Qualitative results on Depth→RGB task with more larger resolution.

BibTeX

@article{zou2025cyclediff,
    title    = {CycleDiff: Cycle Diffusion Models for Unpaired Image-to-image Translation},
    author   = {Zou, Shilong and Huang, Yuhang and Yi, Renjiao and Zhu, Chenyang and Xu, Kai},
    journal  = {arXiv preprint arXiv:2508.06625},
    year     = {2025}
}