Reflection Removal through Efficient Adaptation of Diffusion Transformers

1ETH Zurich 2HUAWEI Bayer Lab *Equal contributors †Project lead
Teaser image demonstrating Marigold depth estimation.

We present WindowSeat, a model and fine-tuning protocol for one-step single-image reflection removal.

Abstract

We introduce a diffusion-transformer (DiT) framework for single-image reflection removal that leverages the generalization strengths of foundation diffusion models in the restoration setting. Rather than relying on task-specific architectures, we repurpose a pre-trained DiT-based foundation model by conditioning it on reflection-contaminated inputs and guiding it toward clean transmission layers. We systematically analyze existing reflection removal data sources for diversity, scalability, and photorealism. To address the shortage of suitable data, we construct a physically based rendering (PBR) pipeline in Blender, built around the Principled BSDF, to synthesize realistic glass materials and reflection effects. Efficient LoRA-based adaptation of the foundation model, combined with the proposed synthetic data, achieves state-of-the-art performance on in-domain and zero-shot benchmarks. These results demonstrate that pretrained diffusion transformers, when paired with physically grounded data synthesis and efficient adaptation, offer a scalable and high-fidelity solution for reflection removal.

Gallery

Qualitative Comparison with Other Recent Methods

Baseline output
WindowSeat output

Our Approach

Physically Based Rendering Pipeline for Data Generation

Our method’s PBR pipeline generates realistic reflection-contaminated training data by simulating true light–glass interaction inside a lightweight Blender setup. It uses the Principled BSDF to control a surface’s physical properties—such as index of refraction, thickness, and roughness—allowing the system to reproduce ghosting, blur, scattering, and high-intensity highlights that simple alpha blending cannot capture.

Marigold training scheme

Fine-tuning Protocol for Modern Diffusion Models

Our model repurposes a large diffusion transformer as a feed-forward reflection-removal network by operating directly in the VAE's latent space and training only lightweight LoRA adapters. During training, the network receives the encoded latent of a reflection-contaminated image and predicts a latent-space update that produces a clean transmission result in a single step, avoiding multi-stage diffusion or auxiliary modules. High-quality PBR data is a key ingredient that ensures that this fine-tuning protocol can be applied to future LDMs without major modifications.

Marigold inference scheme

Quantitative Comparison with Other Recent Methods

Our model consistently outperforms prior reflection-removal methods across both in-domain datasets and challenging zero-shot benchmarks. It delivers higher PSNR and SSIM scores than existing approaches, including diffusion-based, transformer-based, and dual-stream architectures, and shows especially large gains on the SIR2 benchmarks, where it improves zero-shot PSNR by more than 1.5 dB and achieves the highest perceptual quality metrics (MS-SSIM and LPIPS). Qualitative comparisons further show that it handles strong, complex, and high-frequency reflections with fewer artifacts, while other methods often leave reflections partially intact or introduce distortions. Overall, the model sets a new performance level while requiring a simpler architecture and more efficient training.

Quantitative Comparison with Other Recent Methods

Refer to the pdf paper linked above for more details on qualitative, quantitative, and ablation studies.

Citation

@misc{zakarin2025reflectionremovalefficientadaptation,
  title        = {Reflection Removal through Efficient Adaptation of Diffusion Transformers},
  author       = {Daniyar Zakarin and Thiemo Wandel and Anton Obukhov and Dengxin Dai},
  year         = {2025},
  eprint       = {2512.05000},
  archivePrefix= {arXiv},
  primaryClass = {cs.CV},
  url          = {https://arxiv.org/abs/2512.05000},
}