Brain Latent Progression Individual-based spatiotemporal disease progression on 3D Brain MRIs via latent diffusion

Table of Contents

This article aims at reviewing a Alzheimer’s spatiaotemporal disease progression predictive model called Brain Latent Progression (BrLP). All in all, this is a diffusion model that takes a T1w image and some particular covariates from some timepoint and predict/generates another T1w image for a future timepoint.

Now, the methodology of this article is almost completely based on two other important methodologies which are Latent diffusion models and the ControlNet model which are both somewhat intertwined. The two following sections make sure the reader has a general idea of how these subjects interact with one another before later diving into the review of the article itself. To have a more in-detailed look at those two concepts, please refer to the references below.

Latent Diffusion Models (LDM) #

DDPM
- Forward diffusion. Iteratively noise input image $x_0$ until convergence to pure Gaussian noise:
  - $N(x_t;\sqrt{1-\beta_t} x_{t-1}, \beta_t I)$, where $\beta_t$ follows a variance schedule.
- Reverse diffusion. Recover initial image $x_0$ from pure noise $x_T$:
  - Reverse process is defined as the closed form gaussian: $q(x_{t-1} | x_t, x_0) = \mathcal{N}(x_{t-1} | \tilde{\mu}(x_0, x_t), \tilde{\beta_t})$, but we don’t have $x_0$!
  - Reparametrize to instead learn the noise $\epsilon_{\theta}(x_t, t)$, which do not require $x_0$
  - Loss: $L_{\epsilon} = \mathbb{E}_{t, x_t, \epsilon \sim \mathcal{N}(0, I)} \left[ \lVert \epsilon - \epsilon(x_t, t ; \theta) \rVert^2 \right]$
LDM ¹
- Extends DDPM, but reduces the computational resources by performing the diffusion process into a compressed latent space $z$.
- The latent space is learned by an autoencoder. The encoder $z = \mathcal{E}(x)$ and the decoder $x = \mathcal{D}(\mathcal{E}(z))$.

ControlNet #

Intuitively, ControlNet proposes to add a condition to guide (similar to fine-tuning) the denoising process of an already trained large diffusion model. In order to fine-tune a model, one might think to simply continue the training of the model itself with new data, hence being able to condition the generation. However, doing this can easily lead to overfitting, catastrophic forgetting or other complex and undesirable scenarios. In a nutshell, ControlNet proposes to essentially inject a conditioning value in the generation process of the LDM. This means, that ControlNet freezes the weights of the LDM and learns new weights to essentially inject the right information at the appropriate places.

More formally, since a LDM generally has a U-Net structure composed of several encoding and decoding blocks, let’s say one of those blocks is defined as: $$ y = \mathcal{F}(x; \Theta) $$

$\mathcal{F}(\cdot; \Theta)$ being a trained LDM block with fixed weights $\Theta$ taking the input feature map $x$ and outputting another feature map $y$. Now, the base ControlNet formalism adds to this formulation the following way:

$$ y_c = \mathcal{F}(x;\Theta) + {\color{blue}\mathcal{Z}}({\color{red}\mathcal{F}(}x + {\color{violet}\mathcal{Z}(c;\Theta_{z1})}{\color{red}; \Theta_c)}{\color{blue};\Theta_{z2})} $$

$\mathcal{Z}$ being a 1x1 convolution with weights $\color{violet}\Theta_{z1}$ and $\color{blue}\Theta_{z2}$ initialized to zero. $\mathcal{F}(\cdot ; \Theta_c)$ is a trainable copy of the encoding block analogous from the frozen LDM. As some might think that having the weights of the convolution layers initialized at zero, they demonstrate this is not the case. Actually, having those weights initialized at zero prevents from introducing harming noise, making the initial update process much slower and more stable. Initially, since $\color{violet}\Theta_{z11}$ and $\color{blue}\Theta_{z2}$ are initialized at zero, the ControlNet part of this equation equals zero, producing the output $y_c = y$. The following Figure 2 illustrates this formalism.

As illustrated in Figure 3, the ControlNet structure is applied to each encoder block of the UNet structure. Each trainable copy block in the ControlNet architecture has the same architecture and initially the same weights as its corresponding block in the LDM (i.e. Stable Diffusion in this case).

As an example, the following figure also directly comes from the ControlNet ² paper (ref. figure 1). The first column on the left represent the input condition images for both use cases. The second two columns (noted as “Default”) are generations from the latent diffusion model with an empty string as a prompt and the condition images as input to the ControlNet. The last three columns is the same setup, except we replace the empty string by an actual description of the desired image as an input prompt. We notice that input conditions clearly are succeeding at adding an additional guiding constraint for image generation. We will see in the following sections how this can be used in the context of disease progression prediction.

Spatiotemporal disease progression on 3D Brain MRIs ³ #

Early approaches in disease progression modelling focused on the analysis of scalar biomarkers which pointed towards volumetric changes. Here the authors utilize diffusion networks to characterize the progression of the Alzheimer’s disease spacially and temporally. Spatiotemporal modelling allows to observe shape alterations in some brain regions prior to to any detectable volumetric reduction which would not be modelled by scalar biomarkers.

Their methodology is oriented towards addressing 4 important challenges in the disease progression modelling case:

	Challenge	Description	Solved using
1	Individualization	Account for most/all the various individual factors (e.g. demographic & clinical variables).	LDM + ControlNet
2	Longitudinal Data Exploitation	If longitudinal data is available, models should utilize it. It offers non negligeable insights such as the individual rate of progression.	Auxiliary model
3	Spatiotemporal Consistency	Predictions should have a smooth spatial coherence between timepoints.	LAS
4	Memory consumption	The lower the memory, the easier it is to adopt as most scenarios don't have a cluster of GPUs.	Latent diffusion models

Population-based Learning an average disease trajectory from longitudinal patient data and mapping the patient to the average to “curve”. But, hard to map the patient to the population average & too much individual spatial variations w.r.t to the average. Also, the mapping from the age of the subject to the age timeline of the population average does not correspond.
Individual-based Use the age of the subject as the temporal variable and the spatial modeling is done mostly by the following deep learning methods.
- GANs
  - 4D-DaniNet: embeds disease progression knowledge into the loss, generates 2D slices which are merged into a 3D image using a super-resolution module.
  - CounterSynth: model generates diffeomorphic transformations to be applied to the brain image conditioned on the covariates (e.g. age).
- VAEs
- Normalizing flows
  - Bidirectional model that can either: 1) MRI -> estimate brain age, 2) brain age -> MRI. Also uses diffeomorphic transformations as they are differentiable and invertible.
- Diffusion models
  - SADM (Sequence-Aware DM) allows for generation of longitudinal brain scans with autoregressive sampling informed by sequential MRI data.

Challenges

Population average is not individualized enough.
Individual-based methods do not incorporate subject-specific metadata.
Most methods do not exploit the longitudinal data effectively (e.g. demographic).
Predicting deformation fields: can only deform existing anatomical content, cannot generalize to new structures.
Spatiotemporal consistency has only been explored by early works on GANs and attention to this matter was decreasing.
Most methods that try to circumvent the computational burden rely on generating 2D slices and thus failing to capture inter-slice dependencies.

Brain Latent Progression model #

The architecture of the proposed model, the Brain Latent Progression (BrLP) model is made of 4 principal components: 1) an LDM, 2) a ControlNet, 3) an auxiliary model and 4) a LAS block. Figure 1, which was directly taken from the paper ³ illustrate the complete workflow of this model and the following sections briefly detail each of these components.

1. LDM + ControlNet #

The LDM aims to generate 3D brain MRIs in accordance with specific covariates $c = <s, v>$, with $s$ being subject’s metadata (age, sex, cognitive status) and $v$ being progression-related metrics (volume of brain regions).

Train an autoencoder to preduce latent representation $z = \mathcal{E}(x)$ and to decode $\hat{x}^B = \mathcal{D}(z_0)$.
Training a conditional UNet predicting the noise $\epsilon_{\theta}(z_t, t, c)$ using the same loss as LDMs.
Incorporate covariates $c = <s, v>$ using cross-attention.

ControlNet conditions the generation on the “previous frame”. If $x^A$ > $x^B$ are images from a patient at different ages A < B, and their latent representations $z^A$ > $z^B$, ControlNet is has to predict $\epsilon_{\theta, \phi}(z_t^{B}, t, c^{B}, {\color{violet} z^{A}})$. With $\theta$ being the frozen LDM weights and $\phi$ the trainable ControlNet weights. This way, the generation aims to generate the disease progression from the conditioned image $A$ instead of just randomly generating a brain from-scratch based on the covariates.

2. Auxiliary model #

$f_{\psi}$ predicts volumetric changes of AD-related regions $\hat{v}^B$ using the covariates $c = <s, v>$ from the previous time-step(s).

Two cases:

First scan at time A Perform a regression using the single available set of covariables $c^A$ as such: $\hat{v}^B = f_{\psi}(c^A)$
Subject with two or more scans Use Disease Course Mapping (DCM) to provide a more accurate trajectory based on the history of the $n$ available scans as such: $\hat{v}^B = f_{\psi}(c^A, \ldots, c^{n})$

Incorporating prior (from DCM or any other disease progression model) knowledge of volumetric changes.

3. Latent Average Stabilization #

Inference:

Use auxiliary model: $\hat{v}^B = f_{\psi}(c^A)$
Concat to form the co-variate: $c = <s^B, \hat{v}^B>$
Convert the MRI to latent space: $z^A = \mathcal{E}(x^A)$
Sample random noise: $z_T \sim \mathcal{N}(0, I)$
Reverse diffusion by predicting noise to remove: $\epsilon_{\theta, \phi}(z_t, t, c^B, z^A)$
Decode final reconstruction: $\hat{x}^B = \mathcal{D}(z_0)$

Problem: sampling different $z_T \sim \mathcal{N}(0, I)$ generates slightly different images which causes weird artifacts when generating for several timesteps.

Solution: LAS. Essentially sample $n$ times, run reverse diffusion on each of the samples, average the resulting $n$ denoised latent spaces and then decode the average latent space. This allows for better spatiotemporal consistency.

Also allows to compute uncertainty (similarly to Epistemic Neural Networks).

Results #

They have a small ablation study to essentially conclude that:

LAS allows to decrease the volumetric differences and improve the image-based metrics as the number of $n$ runs progressively increase from 2 to 64 (at the expense of computation requirements).
Using the AUXiliary model either improves the volumetric error significantly for conditioned regions, marginally for unconditioned regions and makes no different for the thalamus.
Combining LAS + AUX generally points towards more accurate results.

They also compare against single-frame prediction models and sequence-aware prediction models on an internal and an external dataset (see paper), but the results essentially are:

Internal dataset: significantly beats all other methods across the board.
External dataset: significantly better everywhere except for volumetric errors on conditional region volumes for MCI or AD patients on the Hippocampus/Amygdala regions where its on-par or slightly worse.

Brief results on uncertainty:

Global uncertainty increases significantly with prediction distance (i.e. time between scans).
Higher global uncertainty $\rightarrow$ decreased structural similarity.

The following figure (Fig. 6) shows the uncertainty map (3rd row) in comparison with the actual error map (4th row) and we notice that there’s a higher uncertainty as the age gap increases whilst also having an increased error in those same regions.

Applications #

In AD-related studies, it is more and more important to select candidates with a fast-progressing disease when performing clinical trials. Identifying such candidates is a challenge and failing to do so properly might cause the study of a treatment to be inconclusive, which they call as Type II errors for clinical trials. Even if some treatments might be effective, they might not show clear benefits in subjects having a slow-progressing disease, thus causing the rejection of a potentially beneficial treatment. The authors argue that BrLP can be used to assess the disease progression rate at the individual level, thus facilitating the identification of potential fast-progressor candidates that can be suited for a clinical trial.

Limitations #

Quantitative metrics less accurate for AD patients compared to healthy aging (AD progression prediction is hard because the data is so heterogenous/complex), but applies to all models.
Sex bias: model performs slightly better on female subjects for image-based metrics.
VAE seems to introduce blur/oversmoothing in all images.

References #

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10684-10695). ↩︎
Zhang, L., Rao, A., & Agrawala, M. (2023). Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3836-3847). ↩︎
Puglisi, L., Alexander, D. C., Ravì, D., & Alzheimer’s Disease Neuroimaging Initiative. (2025). Brain latent progression: Individual-based spatiotemporal disease progression on 3D brain MRIs via latent diffusion. Medical Image Analysis, 103734. ↩︎ ↩︎