cs180: proj5
Fun With Diffusion Models
Project 5A: The Power of Diffusion Models
We focus on working with the pre-trained, text-to-image DeepFloyd IF diffusion model, experimenting with inpainting and creating optical illusions.
Part 0: Setup
With the random seed of 180, I tried out 3 prompts, each with num_inference_steps
values of 20 and 100. Overall, I observed how more steps resulted in higher quality outputs. With num_inference_steps = 20
, the outputs look glossy and don’t quite capture the texture — the man looks airbrushed and oil painting looks cartoon-ish. With num_inference_steps = 100
, the man looks more realistic, the rocket ship has a detailed background, and the snowy village looks more like an actual oil painting.
Part 1: Sampling Loops
1.1: Implementing the Forward Process
To take a clean image and add noise to it, I implemented forward(im, t)
where the original image as im
and timestep as t
allow us to add more noise as t
increases. This process is equivalent to computing the below equation:
Here are my results with the given campanile test image.
1.2: Classical Denoising
With Gaussian blur filtering, by attempting to get rid of the noise, we get rid of the “signal” (the original image). As a result, we can see it is still difficult to recover the original image as the denoised output does not work well. Here are the side-by-side results.
Top row: Part 1.1’s results; Bottom row: Gaussian blur denoising results
1.3: One-Step Denoising
Using a pre-trained UNet, we estimate the image’s Gaussian noise at timestep t
and can then remove said noise to (try to) recover the original image. Overall, this process performs better than Part 1.2’s Gaussian blur, but higher t
values result in a deviation from the campanile’s actual look. Here are the side-by-side results.
Top row: original image; Middle row: Part 1.1’s results; Bottom row: one-step denoising results
1.4: Iterative Denoising
To address one-step denoising’s issue with higher t
values, we implement iterative denoising. Each stride has a step size of 30, and we start at t = 990
and work our way down to t = 0
. This process is equivalent to computing the below equation:
Here are the results of the process (displaying the process for every 5 loops).
As a recap, here is the original image with the other methods’ results. As we can see, the best, most-detailed result is the iteratively denoised image.
1.5: Diffusion Model Sampling
With the iterative_denoise
function implemented, I can generate images from completely noisy images. Here are some of my results (not the best quality and independent of seed).
1.6: Classifier-Free Guidance (CFG)
To improve the results from Part 1.5, we can add both an unconditional and conditional noise estimate. Using the technique from the Classifier-Free Diffusion Guidance paper, we define our new noise estimate with controlling the CFG’s strength. Here are 5 images with a CFG scale of .
1.7: Image-to-Image Translation
Using iterative_denoise_cfg
, we add noise to the original image and then iteratively denoise it to generate new images. For all images, I used noise levels [1, 3, 5, 7, 10, 20]
and text prompt "a high quality photo". Here are the results for the given campanile_image.png
and my chosen dog.png
and bay_bridge.png
.
From left to right: increasing noise level, ending with the original image
1.7.1: Editing Hand-Drawn and Web Images
Now, let’s run this same process for hand-drawn and non-realistic images. For all images, I used noise levels [1, 3, 5, 7, 10, 20]
and text prompt "a high quality photo". Here are the results for the internet’s avocado.png
and mike.png
as well as my hand-drawn house.png
and turtle.png
.
From left to right: increasing noise level, ending with the original image
1.7.2: Inpainting
Using a similar process, we implement the inpaint
function to create a new image with the same content where but new content where . Following the RePaint paper, we run the diffusion denoising loop and compute the below equation:
Here are the results for the given campanile_image.png
and my chosen coffee.png
and smiski.png
.
From left to right: original, mask, to fill, inpainted
The smiski one is funny :)
1.7.3: Text-Conditional Image-to-Image Translation
Now, we continue the SDEdit method but guide the projection with a text prompt. Our goal is to create images that gradually look more like the original image while still maintaining similarity to the text prompt. For all images, I used noise levels [1, 3, 5, 7, 10, 20]
. Here are the results.
From left to right: increasing noise level, ending with the original image
Given: campanile_image.png
with "a rocket ship” prompt
Chosen: dog.png
with "a photo of a man” prompt
Chosen: bay_bridge.png
with "a pencil” prompt
1.8: Visual Anagrams
Visual anagrams are images which look like two different images when rotated 180 degrees. To generate visual anagrams, I denoised two times — first, for the original image with prompt 1 and second, for the flipped image with prompt 2. At each step, we combine them by averaging the two noise estimates. This process is equivalent to the below algorithm:
Given: "an oil painting of people around a campfire” with "an oil painting of an old man”
Chosen: “an oil painting of a snowy mountain village” with “a photo of the amalfi coast”
Chosen: “a lithograph of waterfalls” with “a lithograph of a skull”
1.9: Hybrid Images
Hybrid images are images which look like two different images when looking at it up-close versus afar. To generate hybrid images, I denoised the image using 2 prompts. At each step, we combine them by passing one through a high-pass filter and the other through a low-pass filter. To do so, I used a Gaussian blur of kernel_size = 33
and σ = 2
. Following the Factorized Diffusion paper, this process is equivalent to the below algorithm:
Given: “skull” from far away, ”waterfall” when close up
Chosen: “campfire” from far away, ”amalfi coast” when close up
Chosen: “old man” from far away, ”snowy village” when close up
Note: showing two because both are pretty cool!
Project 5B: Diffusion Models from Scratch
With all the learning and experimentation with diffusion models in Part A, we trained our own diffusion models on MNIST in Part B.
Part 1: Training a Single-Step Denoising UNet
1.1: Implementing the UNet
We implemented a one-step denoiser using the UNet architecture below.
1.2: Using the UNet to Train a Denoiser
To prepare for training the denoiser, we need to generate noisy images. We do so by adding noise to a clean MNIST image using , where is a clean MNIST digit, is a constant, and . Our goal is to recover if given .
1.2.1: Training
Now, we can train a denoiser with and hidden dimension on the MNIST dataset. We use an Adam optimizer with a learning rate of 1e-4 and train for 5 epochs. Here is the training process’s loss curve plot.
With training completed, here are the denoised results on the test set. It works decently well, with cleaner results from epoch 5 (best seen with the digit 0 and leftmost digit 3 examples).
From top to bottom: original, noisy , denoised
- 1 Epoch of Training
- 5 Epochs of Training
1.2.2: Out-of-Distribution Testing
We’ve trained our denoiser on noisy digits. Here are the results with more noisy (higher ) and less noisy (lower ) test set digits.
Part 2: Training a Diffusion Model
2.1: Adding Time Conditioning to UNet
We now need to implement a Denoising Diffusion Probabilistic Model (DDPM) to build and train a UNet model that iteratively denoises an image. This UNet follows the architecture below, containing the new operator FCBlock (fully-connected block) to support the conditioning signal.
2.2: Training the UNet
Now, we can train a time-conditioned UNet to predict the noise in if given noisy image and timestep . We do so by selecting a random image, random , predict the noise, and repeat until the model converges. Here is the training process’s loss curve plot.
2.3: Sampling from the UNet
With training complete, here are the results for the time-conditioned UNet, focusing on 5 and 20 epochs.
NOTE: Epoch 20 is the fully trained model.
- 5 Epochs of Training
- 20 Epochs of Training
2.4: Adding Class-Conditioning to UNet
We improve the time-conditioned UNet implementation by optionally conditioning it on the digit 0-9 class by adding 2 more FCBlocks and using a one-hot vector (instead of a single scalar). With this model, we can choose the digit we want to generate. Here is the training process’s loss curve plot.
2.5 Sampling from the Class-Conditioned UNet
With training complete, here are the results for the class-conditioned UNet, focusing on 5 and 20 epochs.
NOTE: Epoch 20 is the fully trained model.
- 5 Epochs of Training
- 20 Epochs of Training
Bells and Whistles: Sampling GIFs
GIF 1: time-conditioned model after 20 epochs of training
GIF 2: time-conditioned and class-conditioned model after 20 epochs of training
Reflection & Bloopers
Definitely a challenging project but had fun reading new papers and experimenting! Splitting up the project and ramping up from working with pre-trained models to training our own in Part B made the project more approachable.
Only saved one blooper :’( so here it is!