The goal of this project is to understand and implement several different state of the art image generation models, including, Denoising Diffusion Probablistic Models (DDPMs), Denoising Diffusion Implicit Models (DDIMs), Score Based Models, Conditional Flow Matching models.

Diffusion and Score Based Models

A diffusion model has two major components, a forward process and a backward/reverse process. The forward process is where the model adds noise gradually to a image until the input signal is completely destroyed. Mathemaically this is a gaussian noise that is being added. The process of adding this noise and destroying the input signal gradually over time is called diffusion process. In the reverse diffusion process the model tries to take a completely noisy image and learns to remove the noise gradually that was added in the forward process. Since this process is denoising the diffusion process, the model is named Denoising Diffusion Probablistic Models.

In the forward diffusion process, the amount of noise added at each timestep is governed by a variance schedule $\{\beta_t\}$, where each $\beta_t \in (0,1)$ specifies how much Gaussian noise is injected at step $t$. From this schedule we define $\alpha_t = 1 - \beta_t$, which represents the proportion of the original signal that remains after a single diffusion step. Since $\beta_t$ is always positive, $\alpha_t$ is always less than one, implying that a small portion of the image is lost at every timestep. Over multiple steps, the cumulative product $\bar{\alpha}t = \prod{i=1}^t \alpha_i$ expresses how much of the initial image $x_0$ survives after $t$ iterations of the forward process. As $t$ increases, $\bar{\alpha}_t$ monotonically decreases toward zero, indicating that the clean signal becomes progressively overwhelmed by noise. This formulation characterizes the diffusion process, at early timesteps the image remains largely intact, while at later timesteps the representation becomes dominated by Gaussian noise, eventually approaching pure noise at the final diffusion step. This progression establishes the fully corrupted state from which the reverse denoising process must reconstruct the data distribution. Formally the equation for the forward diffusion process becomes:

$$ x_t = \sqrt{\bar{\alpha_t}} \cdot x_0 + \sqrt{1 - \bar{\alpha_t}}\epsilon $$

I implemented the code for sampling from the given pretrained diffusion model. Mathematically this denoising diffusion process is defined as:

$$ x_{t-1} = \mu_\theta(x_t,t) + \sigma_tz $$

where $\mu_\theta(x_t,t)$, the posterior mean approximation is defined as:

$$ \mu_\theta(x_t,t) = \frac{1}{\sqrt{\alpha_t}}x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha_t}}\sqrt{\alpha_t}}\hat{\epsilon_\theta}(x_t, t) $$

For the sake of reproducibility of results, the random gaussian noise is fixed as input during sampling. For example, if I want to generate 10 images, the images are generated from its respective fixed pure gaussian noise each time i run the code. The noise in the $\sigma_tz$ term added during the denoising step is also fixed based on the random seed, which makes the model behave deterministically and generate the same images each time I sample from the model. Figure 1 shows 10 sampled images form the pretrained diffusion model, where each image took roughly 30 seconds to generate.

Fig 1: Images sampled from the pretrained diffusion model

Fig 2: Images generated using random z during denoising step

I even experimented using random z for sampling instead of a fixed z during the denoising process, and the model generated different images each time I try to sample from it as shown in Figure 2. I think the model becomes stochastic when this random z is added at each timestep during the denoising diffusion process, even if it starts from the same pure gaussian noise. To get an idea of how the model removes and denoises the pure gaussian noise, figure 3 shows the intermediate noisy images during the denoising process.

Fig 3: Denoising process of a diffusion model

I then implemented a function that predicts $x_0$ directly given $x_t$, to understand how the model preforms if we remove the total noise in a single step rather than removing it gradually over T time steps. Mathematically,

$$ \hat{X_{0 \mid t}} = C_t \cdot x_t + D_t \cdot \hat{\epsilon_\theta}(x_t, t) $$

This can be derived from the forward process of DDPM where,

$$ x_t = \sqrt{\bar{\alpha_t}} \cdot x_0 + \sqrt{1 - \bar{\alpha_t}}\epsilon $$

$$ x_0 = \frac{1}{\sqrt{\bar{\alpha_t}}} (x_t - \sqrt{1-\bar{\alpha_t}}\epsilon) $$

$$ \text{where, } C_t = \frac{1}{\sqrt{\bar{\alpha_t}}}, \text{ } D_t = -\frac{\sqrt{1 - \bar{\alpha_t}}}{\sqrt{\bar{\alpha_t}}}, \text{and, } \epsilon = \hat{\epsilon_\theta}(x_t, t) $$