cs180: proj5

Fun With Diffusion Models

Project 5A: The Power of Diffusion Models

We focus on working with the pre-trained, text-to-image DeepFloyd IF diffusion model, experimenting with inpainting and creating optical illusions.

Part 0: Setup

With the random seed of 180, I tried out 3 prompts, each with num_inference_steps values of 20 and 100. Overall, I observed how more steps resulted in higher quality outputs. With num_inference_steps = 20, the outputs look glossy and don’t quite capture the texture — the man looks airbrushed and oil painting looks cartoon-ish. With num_inference_steps = 100, the man looks more realistic, the rocket ship has a detailed background, and the snowy village looks more like an actual oil painting.

a man wearing a hat (20 steps)
a rocket ship (20 steps)
an oil painting of a snowy mountain village (20 steps)
a man wearing a hat (100 steps)
a rocket ship (100 steps)
an oil painting of a snowy mountain village (100 steps)

Part 1: Sampling Loops

1.1: Implementing the Forward Process

To take a clean image and add noise to it, I implemented forward(im, t) where the original image as im and timestep as t allow us to add more noise as t increases. This process is equivalent to computing the below equation:

Here are my results with the given campanile test image.

original
t = 250
t = 500
t = 750

1.2: Classical Denoising

With Gaussian blur filtering, by attempting to get rid of the noise, we get rid of the “signal” (the original image). As a result, we can see it is still difficult to recover the original image as the denoised output does not work well. Here are the side-by-side results.

Top row: Part 1.1’s results; Bottom row: Gaussian blur denoising results

t = 250
t = 500
t = 750
t = 250
t = 500
t = 750

1.3: One-Step Denoising

Using a pre-trained UNet, we estimate the image’s Gaussian noise at timestep t and can then remove said noise to (try to) recover the original image. Overall, this process performs better than Part 1.2’s Gaussian blur, but higher t values result in a deviation from the campanile’s actual look. Here are the side-by-side results.

Top row: original image; Middle row: Part 1.1’s results; Bottom row: one-step denoising results

original
t = 250
t = 500
t = 750
t = 250
t = 500
t = 750

1.4: Iterative Denoising

To address one-step denoising’s issue with higher t values, we implement iterative denoising. Each stride has a step size of 30, and we start at t = 990 and work our way down to t = 0. This process is equivalent to computing the below equation:

Here are the results of the process (displaying the process for every 5 loops).

t = 690
t = 540
t = 390
t = 240
t = 90

As a recap, here is the original image with the other methods’ results. As we can see, the best, most-detailed result is the iteratively denoised image.

original
gaussian
one-step
iterative

1.5: Diffusion Model Sampling

With the iterative_denoise function implemented, I can generate images from completely noisy images. Here are some of my results (not the best quality and independent of seed).

1.6: Classifier-Free Guidance (CFG)

To improve the results from Part 1.5, we can add both an unconditional and conditional noise estimate. Using the technique from the Classifier-Free Diffusion Guidance paper, we define our new noise estimate with γ\gamma controlling the CFG’s strength. Here are 5 images with a CFG scale of γ=7\gamma = 7.

1.7: Image-to-Image Translation

Using iterative_denoise_cfg, we add noise to the original image and then iteratively denoise it to generate new images. For all images, I used noise levels [1, 3, 5, 7, 10, 20] and text prompt "a high quality photo". Here are the results for the given campanile_image.png and my chosen dog.png and bay_bridge.png.

From left to right: increasing noise level, ending with the original image

1.7.1: Editing Hand-Drawn and Web Images

Now, let’s run this same process for hand-drawn and non-realistic images. For all images, I used noise levels [1, 3, 5, 7, 10, 20] and text prompt "a high quality photo". Here are the results for the internet’s avocado.png and mike.png as well as my hand-drawn house.png and turtle.png.

From left to right: increasing noise level, ending with the original image

1.7.2: Inpainting

Using a similar process, we implement the inpaint function to create a new image with the same content where m=0m = 0 but new content where m=1m = 1. Following the RePaint paper, we run the diffusion denoising loop and compute the below equation:

Here are the results for the given campanile_image.png and my chosen coffee.png and smiski.png.

From left to right: original, mask, to fill, inpainted

The smiski one is funny :)

1.7.3: Text-Conditional Image-to-Image Translation

Now, we continue the SDEdit method but guide the projection with a text prompt. Our goal is to create images that gradually look more like the original image while still maintaining similarity to the text prompt. For all images, I used noise levels [1, 3, 5, 7, 10, 20]. Here are the results.

From left to right: increasing noise level, ending with the original image

Given: campanile_image.png with "a rocket ship” prompt

Chosen: dog.png with "a photo of a man” prompt

Chosen: bay_bridge.png with "a pencil” prompt

1.8: Visual Anagrams

Visual anagrams are images which look like two different images when rotated 180 degrees. To generate visual anagrams, I denoised two times — first, for the original image with prompt 1 and second, for the flipped image with prompt 2. At each step, we combine them by averaging the two noise estimates. This process is equivalent to the below algorithm:

Given: "an oil painting of people around a campfire” with "an oil painting of an old man”

Chosen: “an oil painting of a snowy mountain village” with “a photo of the amalfi coast”

Chosen: “a lithograph of waterfalls” with “a lithograph of a skull”

1.9: Hybrid Images

Hybrid images are images which look like two different images when looking at it up-close versus afar. To generate hybrid images, I denoised the image using 2 prompts. At each step, we combine them by passing one through a high-pass filter and the other through a low-pass filter. To do so, I used a Gaussian blur of kernel_size = 33 and σ\sigmaσ = 2. Following the Factorized Diffusion paper, this process is equivalent to the below algorithm:

Given: “skull” from far away, ”waterfall” when close up

Chosen: “campfire” from far away, ”amalfi coast” when close up

Chosen: “old man” from far away, ”snowy village” when close up

Note: showing two because both are pretty cool!

Project 5B: Diffusion Models from Scratch

With all the learning and experimentation with diffusion models in Part A, we trained our own diffusion models on MNIST in Part B.

Part 1: Training a Single-Step Denoising UNet

1.1: Implementing the UNet

We implemented a one-step denoiser using the UNet architecture below.

1.2: Using the UNet to Train a Denoiser

To prepare for training the denoiser, we need to generate noisy images. We do so by adding noise to a clean MNIST image using z=x+σϵz = x + \sigma * \epsilon, where xx is a clean MNIST digit, σ\sigma is a constant, and ϵN(0,I)\epsilon \sim N(0,I). Our goal is to recover xx if given zz.

1.2.1: Training

Now, we can train a denoiser with σ=0.5σ = 0.5 and hidden dimension D=128D = 128 on the MNIST dataset. We use an Adam optimizer with a learning rate of 1e-4 and train for 5 epochs. Here is the training process’s loss curve plot.

With training completed, here are the denoised results on the test set. It works decently well, with cleaner results from epoch 5 (best seen with the digit 0 and leftmost digit 3 examples).

From top to bottom: original, noisy σ=0.5\mathit{σ = 0.5}, denoised

  • 1 Epoch of Training
  • 5 Epochs of Training

1.2.2: Out-of-Distribution Testing

We’ve trained our denoiser on noisy σ=0.5σ = 0.5 digits. Here are the results with more noisy (higher σσ) and less noisy (lower σσ) test set digits.

Part 2: Training a Diffusion Model

2.1: Adding Time Conditioning to UNet

We now need to implement a Denoising Diffusion Probabilistic Model (DDPM) to build and train a UNet model that iteratively denoises an image. This UNet follows the architecture below, containing the new operator FCBlock (fully-connected block) to support the conditioning signal.

2.2: Training the UNet

Now, we can train a time-conditioned UNet to predict the noise in xtx_t if given noisy image xtx_t and timestep tt. We do so by selecting a random image, random tt, predict the noise, and repeat until the model converges. Here is the training process’s loss curve plot.

2.3: Sampling from the UNet

With training complete, here are the results for the time-conditioned UNet, focusing on 5 and 20 epochs.

NOTE: Epoch 20 is the fully trained model.

  • 5 Epochs of Training
  • 20 Epochs of Training

2.4: Adding Class-Conditioning to UNet

We improve the time-conditioned UNet implementation by optionally conditioning it on the digit 0-9 class by adding 2 more FCBlocks and using a one-hot vector (instead of a single scalar). With this model, we can choose the digit we want to generate. Here is the training process’s loss curve plot.

2.5 Sampling from the Class-Conditioned UNet

With training complete, here are the results for the class-conditioned UNet, focusing on 5 and 20 epochs.

NOTE: Epoch 20 is the fully trained model.

  • 5 Epochs of Training
  • 20 Epochs of Training

Bells and Whistles: Sampling GIFs

GIF 1: time-conditioned model after 20 epochs of training

GIF 2: time-conditioned and class-conditioned model after 20 epochs of training

Reflection & Bloopers

Definitely a challenging project but had fun reading new papers and experimenting! Splitting up the project and ramping up from working with pre-trained models to training our own in Part B made the project more approachable.

Only saved one blooper :’( so here it is!