Without overlapping views, MultiDiffusion can not produce coherent panoramas. By averaging overlapping denoising predictions, MultiDiffusion can produce coherent images, but it is slow. We introduce an efficient method for high-resolution panorama generation that eliminates the need for overlapping denoising predictions, resulting in coherent and sharp images without border artifacts.
Generating high-resolution images with generative models has recently been made widely accessible by leveraging diffusion models pre-trained on large-scale datasets. Various techniques, such as MultiDiffusion and SyncDiffusion, have further pushed image generation beyond training resolutions, i.e., from square images to panorama, by merging multiple overlapping diffusion paths or employing gradient descent to maintain perceptual coherence. However, these methods suffer from significant computational inefficiencies due to generating and averaging numerous predictions, which is required in practice to produce high-quality and seamless images. This work addresses this limitation and presents a novel approach that eliminates the need to generate and average numerous overlapping denoising predictions. Our method shifts non-overlapping denoising windows over time, ensuring that seams in one timestep are corrected in the next. This results in coherent, high-resolution images with fewer overall steps. We demonstrate the effectiveness of our approach through qualitative and quantitative evaluations, comparing it with MultiDiffusion, SyncDiffusion, and StitchDiffusion. Our method offers several key benefits, including improved computational efficiency and faster inference times while producing comparable or better image quality.
MultiDiffusion generates coherent panorama images by averaging overlapping denoising predictions with a stride that is smaller than the denoising window. Our method eliminates the need for overlapping denoising predictions and introduces a more efficient shifting denoising method. Instead of relying on a fixed denoising path with overlapping views, our method shifts the denoising windows over time, ensuring that seams in one timestep are corrected in the next. This results in fast, seamless, high-resolution images with fewer overall steps.
Quantitative results on 512×2048 panorama generation. Fast inference with non-overlapping windows comparison is marked in gray. Our method consistently produces high-quality images as measured by FID with good image-text alignment as measured by CLIPScore and ImageReward. In contrast to the baselines, our method does not require overlapping denoising windows, thus significantly reducing the number of required denoising views and inference time. We reach similar performance as the dense baselines but with a fraction of the time.
Left: CLIPScore comparison of the base StableDiffusion model with MultiDiffusion and our method. Right: FID comparison of SyncDiffusion with MultiDiffusion and our method. Our method reaches similar performance as MultiDiffusion but is significantly faster.
@article{frolov2024spotdiffusion,
title={SpotDiffusion: A Fast Approach For Seamless Panorama Generation Over Time},
author={Frolov, Stanislav and Moser, Brian B and Dengel, Andreas},
journal={arXiv preprint arXiv:2407.15507},
year={2024}
}