Diffusion Model 기초 이론과 오믹스 데이터 적용 가이드

이 포스트는 Diffusion Model의 기본 이론을 수학적으로 설명하고, 오믹스(omics) 데이터에 적용하는 방법을 PyTorch 예제 코드와 함께 다룬다. 바이오인포매틱스 연구자가 생성 모델을 이해하고 활용하는 데 도움이 되기를 바란다.

왜 Diffusion Model인가?

생성 모델(generative model)은 데이터의 분포를 학습하여 새로운 샘플을 생성하는 모델이다. 이미지 생성에서 시작된 확산 모델(diffusion model)은 이제 단일세포 전사체(single-cell transcriptomics), 단백질체(proteomics), 약물 반응 예측 등 생명과학 분야 전반으로 확장되고 있다.

기존 생성 모델들과의 비교:

모델	장점	단점	오믹스 적용 사례
GAN	빠른 생성 속도	학습 불안정, mode collapse	scGAN
VAE	안정적 학습, 잠재 공간 해석	blurry한 생성 결과	scVI, PRnet
Flow	정확한 likelihood 계산	구조적 제약	scFlow
Diffusion	높은 생성 품질, 안정적 학습	느린 생성 속도	scDiffusion, GDDM

Diffusion model이 오믹스 분야에서 주목받는 이유는 명확하다:

학습 안정성: GAN처럼 generator-discriminator의 균형을 맞출 필요 없음
고품질 생성: VAE의 over-smoothing 문제 없이 데이터의 세밀한 분포를 포착
유연한 조건부 생성: 세포 유형, 조직, 질병 상태 등 다양한 조건으로 생성 제어 가능
이론적 기반: 비평형 열역학(non-equilibrium thermodynamics)에 기초한 탄탄한 수학적 프레임워크

Part 1: Diffusion Model 기본 이론

핵심 아이디어

Diffusion model의 핵심은 놀랍도록 단순하다:

순방향 과정(Forward Process): 데이터에 점진적으로 가우시안 노이즈(Gaussian noise)를 추가하여 순수 노이즈로 변환
역방향 과정(Reverse Process): 노이즈로부터 원래 데이터를 복원하는 과정을 신경망으로 학습

비유하자면, 잉크를 물에 떨어뜨리면 점차 확산(diffusion)되어 균일한 색이 된다. 역방향 과정은 이 균일한 색의 물에서 원래 잉크 방울의 위치를 복원하는 것과 같다.

1.1 순방향 과정 (Forward Diffusion Process)

원본 데이터 x₀에서 시작하여 T 스텝에 걸쳐 점진적으로 가우시안 노이즈를 추가하는 마르코프 체인(Markov chain)이다.

각 스텝에서의 노이즈 추가:

q(xₜ | xₜ₋₁) = N(xₜ; √(1 - βₜ) · xₜ₋₁, βₜ · I)

여기서:

βₜ : 타임스텝 t에서의 노이즈 스케줄(noise schedule), 0 < β₁ < β₂ < … < βT < 1
N(μ, σ²I) : 평균 μ, 분산 σ²I인 가우시안 분포
I : 단위 행렬

전체 순방향 과정을 한 번에 표현하면 (reparameterization trick):

αₜ = 1 - βₜ
ᾱₜ = α₁ · α₂ · ... · αₜ  (cumulative product)

q(xₜ | x₀) = N(xₜ; √ᾱₜ · x₀, (1 - ᾱₜ) · I)

이를 통해 임의의 타임스텝 t에서의 노이즈 데이터를 직접 샘플링할 수 있다:

xₜ = √ᾱₜ · x₀ + √(1 - ᾱₜ) · ε,  ε ~ N(0, I)

이것이 매우 중요한 성질이다. 학습 시 순차적으로 노이즈를 추가할 필요 없이, 원본 데이터 x₀로부터 어떤 타임스텝의 노이즈 버전이든 한 번에 얻을 수 있다.

1.2 노이즈 스케줄 (Noise Schedule)

βₜ의 설계는 모델 성능에 큰 영향을 미친다. 대표적인 스케줄:

Linear Schedule (Ho et al., 2020):

βₜ = β₁ + (β_T - β₁) · (t - 1) / (T - 1)
β₁ = 0.0001, β_T = 0.02, T = 1000

Cosine Schedule (Nichol & Dhariwal, 2021):

ᾱₜ = f(t) / f(0),  f(t) = cos((t/T + s) / (1 + s) · π/2)²

Cosine schedule은 linear schedule에 비해 초반 스텝에서 정보를 더 오래 보존하여 학습 효율이 높다. 오믹스 데이터는 sparse한 특성이 있어, cosine schedule이 더 적합한 경우가 많다.

1.3 역방향 과정 (Reverse Process)

역방향 과정은 순수 노이즈 x_T ~ N(0, I)에서 시작하여 원본 데이터 x₀를 복원한다:

pθ(xₜ₋₁ | xₜ) = N(xₜ₋₁; μθ(xₜ, t), Σθ(xₜ, t))

신경망 εθ가 각 타임스텝에서 추가된 노이즈를 예측하도록 학습된다. 예측된 노이즈로부터 평균 μθ를 계산:

μθ(xₜ, t) = 1/√αₜ · (xₜ - βₜ/√(1 - ᾱₜ) · εθ(xₜ, t))

1.4 학습 목적 함수 (Training Objective)

DDPM(Denoising Diffusion Probabilistic Models, Ho et al. 2020)의 학습은 단순한 노이즈 예측 MSE 손실로 귀결된다:

L_simple = E[‖ε - εθ(xₜ, t)‖²]
         = E[‖ε - εθ(√ᾱₜ · x₀ + √(1 - ᾱₜ) · ε, t)‖²]

여기서:

ε ~ N(0, I): 실제 추가된 노이즈
εθ(xₜ, t): 신경망이 예측한 노이즈
t ~ Uniform(1, T): 무작위로 선택된 타임스텝

학습 과정을 한 눈에 보면:

원본 데이터 x₀를 배치에서 샘플링
타임스텝 t를 균등 분포에서 샘플링
가우시안 노이즈 ε ~ N(0, I) 샘플링
노이즈 데이터 계산: xₜ = √ᾱₜ · x₀ + √(1 - ᾱₜ) · ε
신경망으로 노이즈 예측: ε̂ = εθ(xₜ, t)
손실 계산: L = ‖ε - ε̂‖²
역전파로 θ 업데이트

1.5 Score-Based 관점

Song & Ermon (2019)의 score-based generative model 관점에서, 확산 모델은 스코어 함수(score function)를 학습하는 것으로 해석할 수 있다:

스코어 함수: sθ(x) ≈ ∇x log p(x)

스코어 함수는 데이터 분포의 로그 확률의 기울기(gradient)로, “데이터 밀도가 높은 방향”을 가리킨다. 노이즈 예측 εθ와 스코어 함수는 다음 관계로 연결된다:

sθ(xₜ, t) = -εθ(xₜ, t) / √(1 - ᾱₜ)

즉, 노이즈를 예측하는 것은 곧 스코어 함수를 학습하는 것과 동치다.

1.6 SDE 통합 프레임워크

Song et al. (ICLR 2021, Outstanding Paper)은 DDPM과 score-based model을 확률 미분 방정식(SDE)으로 통합했다. 타임스텝 수를 무한대로 보내면, 이산적 마르코프 체인이 연속 시간 SDE로 수렴한다:

순방향 SDE:

dx = f(x, t)dt + g(t)dw

역방향 SDE:

dx = [f(x, t) - g(t)² · ∇x log pₜ(x)]dt + g(t)dw̄

여기서 ∇x log pₜ(x)가 바로 스코어 함수다. 또한 확률적 노이즈 없이 결정론적으로 샘플링하는 Probability Flow ODE도 유도된다:

dx = [f(x, t) - ½g(t)² · ∇x log pₜ(x)]dt

이 ODE는 DDIM(Denoising Diffusion Implicit Models)의 이론적 기반이 되며, 더 적은 스텝으로 빠른 샘플링을 가능하게 한다.

Part 2: 오믹스 데이터에 대한 Diffusion Model

2.1 오믹스 데이터의 특수성

유전자 발현(gene expression) 데이터는 이미지와 근본적으로 다른 특성을 가진다:

특성	이미지	유전자 발현
차원	고정 해상도 (예: 256×256×3)	수천~수만 유전자
공간 구조	인접 픽셀 간 강한 상관관계	유전자 간 순서 무관 (비정렬)
분포	[0, 255] 연속 정수	비음수, 고도로 희소(sparse)
제로 비율	극소	60~90% (dropout)
네트워크	CNN, ViT 등	MLP, Transformer

이러한 차이점 때문에 이미지용 확산 모델을 오믹스에 직접 적용하면 성능이 좋지 않다. 핵심적인 적응(adaptation) 전략은 다음과 같다:

2.2 잠재 확산 모델 (Latent Diffusion Model) 접근

고차원 유전자 발현 데이터에 직접 확산을 적용하는 대신, 잠재 공간(latent space)에서 확산을 수행하는 전략이 효과적이다:

[유전자 발현 프로파일] → [인코더] → [잠재 표현] → [확산 과정] → [디코더] → [생성된 프로파일]
   ~20,000 차원            128~256 차원                           ~20,000 차원

장점:

차원 축소로 계산 효율 대폭 향상
인코더가 dropout 노이즈를 제거하여 더 깨끗한 표현 학습
잠재 공간이 가우시안에 가까운 분포로 정규화되어 확산 과정에 적합

대표적인 사례:

scDiffusion (Luo et al., 2024): SCimilarity 파운데이션 모델을 인코더로 활용, 128차원 잠재 공간
GDDM (Gao et al., 2024): VAE 기반 인코더-디코더 + 잠재 공간 확산

2.3 네트워크 아키텍처 선택

이미지 확산 모델에서는 U-Net이나 DiT(Diffusion Transformer)가 표준이지만, 유전자 발현 데이터에서는 공간 구조가 없으므로 다른 아키텍처가 필요하다:

MLP 기반 디노이징 네트워크 (가장 일반적):

스킵 연결(skip connection)이 있는 다층 퍼셉트론
타임스텝 임베딩을 각 레이어에 주입
유전자 간 순서가 무의미하므로 MLP가 자연스러운 선택

Transformer 기반:

유전자를 토큰으로 취급하여 self-attention 적용
유전자 간 상호작용(gene-gene interaction)을 명시적으로 모델링 가능
계산 비용이 높지만 표현력이 풍부

2.4 조건부 생성 전략

오믹스 데이터의 조건부 생성은 크게 두 가지 방식으로 이루어진다:

분류기 가이던스 (Classifier Guidance):

별도의 분류기를 학습하여 생성 방향을 제어
장점: 디노이징 네트워크와 독립적으로 학습 가능
단점: 추가 모델 필요
예: scDiffusion의 세포 유형 분류기

분류기 없는 가이던스 (Classifier-Free Guidance, CFG):

조건부/비조건부 생성을 하나의 모델에서 동시에 학습
학습 시 일정 확률로 조건을 dropout (빈 조건으로 대체)
추론 시 조건부와 비조건부 예측의 차이를 증폭
예: cfDiffusion (Zhang et al., 2024)

ε̂_guided = ε̂_uncond + w · (ε̂_cond - ε̂_uncond)

여기서 w는 가이던스 강도(guidance scale)로, 크게 할수록 조건에 더 충실한 생성이 이루어진다.

2.5 주요 오믹스 Diffusion Model 비교

모델	연도	데이터 유형	잠재 공간	가이던스	핵심 특징
scDiffusion	2024	scRNA-seq	SCimilarity (128d)	Classifier	Gradient Interpolation으로 발달 궤적 생성
cfDiffusion	2025	scRNA-seq	Autoencoder	Classifier-free	별도 분류기 불필요, 다중 속성 조건부 생성
scVAEDer	2025	scRNA-seq	VAE + Diffusion	하이브리드	벡터 산술로 섭동 반응 예측
scLDM	2025	scRNA-seq	Transformer VAE	Classifier-free	유전자 교환가능성(exchangeability) 보장, Flow Matching 손실
DCM	2026	scRNA-seq	직접 (이산)	Conditional	이산 확산으로 카운트 데이터 직접 모델링

최근 동향: 2025년 이후로 Classifier-free guidance가 주류가 되고 있으며(cfDiffusion), 유전자 발현의 이산적(discrete) 특성을 직접 모델링하는 DCM(2026)이 기존 연속 확산 모델 대비 MMD² RBF에서 5배 향상을 보고했다.

Part 3: PyTorch로 구현하는 오믹스 Diffusion Model

이론을 코드로 옮겨보자. 유전자 발현 데이터에 대한 간단한 잠재 확산 모델(Latent Diffusion Model)을 구현한다.

3.1 환경 설정

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
from torch.utils.data import DataLoader, TensorDataset
import scanpy as sc
import anndata as ad

3.2 데이터 전처리

scRNA-seq 데이터를 확산 모델에 적합한 형태로 전처리한다.

def preprocess_scrna(adata, n_top_genes=2000):
    """scRNA-seq 데이터 전처리 파이프라인"""
    # 기본 필터링
    sc.pp.filter_cells(adata, min_genes=200)
    sc.pp.filter_genes(adata, min_cells=3)

    # 정규화 및 로그 변환
    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)

    # 고변이 유전자 선택 (Highly Variable Genes)
    sc.pp.highly_variable_genes(adata, n_top_genes=n_top_genes)
    adata = adata[:, adata.var.highly_variable].copy()

    # 스케일링 (z-score 정규화)
    sc.pp.scale(adata, max_value=10)

    return adata

# 예시: PBMC 데이터 로드
# adata = sc.read_h5ad("pbmc_data.h5ad")
# adata = preprocess_scrna(adata)
# X = torch.FloatTensor(adata.X)  # (n_cells, n_genes)
# cell_types = adata.obs['cell_type'].cat.codes.values  # 정수 레이블

3.3 오토인코더 (Autoencoder)

유전자 발현 데이터를 저차원 잠재 공간으로 압축하는 오토인코더를 먼저 학습한다.

class GeneAutoencoder(nn.Module):
    """유전자 발현 오토인코더 — 고차원 데이터를 잠재 공간으로 압축"""

    def __init__(self, n_genes, latent_dim=128):
        super().__init__()

        # 인코더: n_genes → 512 → 256 → latent_dim
        self.encoder = nn.Sequential(
            nn.Linear(n_genes, 512),
            nn.ReLU(),
            nn.BatchNorm1d(512),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Linear(256, latent_dim),
        )

        # 디코더: latent_dim → 256 → 512 → n_genes
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Linear(256, 512),
            nn.ReLU(),
            nn.BatchNorm1d(512),
            nn.Linear(512, n_genes),
        )

    def encode(self, x):
        return self.encoder(x)

    def decode(self, z):
        return self.decoder(z)

    def forward(self, x):
        z = self.encode(x)
        x_recon = self.decode(z)
        return x_recon, z

3.4 노이즈 스케줄 및 확산 유틸리티

class DiffusionSchedule:
    """Cosine 노이즈 스케줄 — 오믹스 데이터에 적합"""

    def __init__(self, timesteps=1000, s=0.008):
        self.timesteps = timesteps

        # Cosine schedule 계산
        steps = torch.arange(timesteps + 1, dtype=torch.float64)
        f = torch.cos((steps / timesteps + s) / (1 + s) * torch.pi / 2) ** 2
        alphas_cumprod = f / f[0]

        # βₜ 계산 (ᾱₜ로부터)
        betas = 1 - alphas_cumprod[1:] / alphas_cumprod[:-1]
        betas = torch.clamp(betas, 0.0001, 0.999)

        self.betas = betas.float()
        self.alphas = (1.0 - self.betas).float()
        self.alphas_cumprod = torch.cumprod(self.alphas, dim=0)

        # 역방향 과정에 필요한 값들
        self.sqrt_alphas_cumprod = torch.sqrt(self.alphas_cumprod)
        self.sqrt_one_minus_alphas_cumprod = torch.sqrt(1.0 - self.alphas_cumprod)
        self.sqrt_recip_alphas = torch.sqrt(1.0 / self.alphas)

    def q_sample(self, x0, t, noise=None):
        """순방향 과정: xₜ = √ᾱₜ · x₀ + √(1-ᾱₜ) · ε"""
        if noise is None:
            noise = torch.randn_like(x0)

        sqrt_alpha = self.sqrt_alphas_cumprod[t].view(-1, 1)
        sqrt_one_minus_alpha = self.sqrt_one_minus_alphas_cumprod[t].view(-1, 1)

        return sqrt_alpha * x0 + sqrt_one_minus_alpha * noise

3.5 디노이징 네트워크

유전자 발현 데이터에 맞는 MLP 기반 디노이징 네트워크를 구현한다. 핵심은 타임스텝 임베딩과 스킵 연결이다.

class SinusoidalPosEmb(nn.Module):
    """사인파 위치 임베딩 — 타임스텝 정보를 벡터로 인코딩"""

    def __init__(self, dim):
        super().__init__()
        self.dim = dim

    def forward(self, t):
        device = t.device
        half_dim = self.dim // 2
        emb = np.log(10000) / (half_dim - 1)
        emb = torch.exp(torch.arange(half_dim, device=device) * -emb)
        emb = t[:, None].float() * emb[None, :]
        return torch.cat([emb.sin(), emb.cos()], dim=-1)


class ResidualMLPBlock(nn.Module):
    """스킵 연결이 있는 MLP 블록"""

    def __init__(self, dim, time_emb_dim):
        super().__init__()
        self.mlp = nn.Sequential(
            nn.Linear(dim, dim),
            nn.SiLU(),
            nn.Linear(dim, dim),
        )
        # 타임스텝 조건화를 위한 projection
        self.time_proj = nn.Sequential(
            nn.SiLU(),
            nn.Linear(time_emb_dim, dim),
        )
        self.norm = nn.LayerNorm(dim)

    def forward(self, x, t_emb):
        # 타임스텝 정보를 더한 뒤 MLP 통과 + 스킵 연결
        h = self.norm(x)
        h = h + self.time_proj(t_emb)
        return x + self.mlp(h)


class OmicsDenoiser(nn.Module):
    """오믹스 데이터용 디노이징 네트워크

    입력: 노이즈가 추가된 잠재 표현 zₜ + 타임스텝 t
    출력: 예측된 노이즈 ε̂
    """

    def __init__(self, latent_dim=128, hidden_dim=512, time_emb_dim=128,
                 n_layers=6, n_classes=None):
        super().__init__()

        # 타임스텝 임베딩
        self.time_mlp = nn.Sequential(
            SinusoidalPosEmb(time_emb_dim),
            nn.Linear(time_emb_dim, time_emb_dim),
            nn.SiLU(),
        )

        # 조건부 생성: 세포 유형 임베딩 (선택)
        self.class_emb = None
        if n_classes is not None:
            self.class_emb = nn.Embedding(n_classes, time_emb_dim)

        # 입력 projection
        self.input_proj = nn.Linear(latent_dim, hidden_dim)

        # Residual MLP 블록 스택
        self.blocks = nn.ModuleList([
            ResidualMLPBlock(hidden_dim, time_emb_dim)
            for _ in range(n_layers)
        ])

        # 출력 projection (노이즈 예측)
        self.output_proj = nn.Sequential(
            nn.LayerNorm(hidden_dim),
            nn.Linear(hidden_dim, latent_dim),
        )

    def forward(self, z_t, t, class_labels=None):
        """
        Args:
            z_t: 노이즈가 추가된 잠재 표현 (batch, latent_dim)
            t: 타임스텝 (batch,)
            class_labels: 세포 유형 레이블 (batch,), 선택
        Returns:
            noise_pred: 예측된 노이즈 (batch, latent_dim)
        """
        # 타임스텝 + 조건 임베딩
        t_emb = self.time_mlp(t)
        if self.class_emb is not None and class_labels is not None:
            t_emb = t_emb + self.class_emb(class_labels)

        # 디노이징
        h = self.input_proj(z_t)
        for block in self.blocks:
            h = block(h, t_emb)

        return self.output_proj(h)

3.6 학습 루프

def train_diffusion(model, autoencoder, dataloader, schedule,
                    epochs=100, lr=1e-4, device='cuda'):
    """잠재 확산 모델 학습 루프"""

    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)
    schedule_device = {k: v.to(device) for k, v in vars(schedule).items()
                       if isinstance(v, torch.Tensor)}

    model.train()
    autoencoder.eval()  # 오토인코더는 고정

    for epoch in range(epochs):
        total_loss = 0
        for batch_x, batch_labels in dataloader:
            batch_x = batch_x.to(device)
            batch_labels = batch_labels.to(device)

            # Step 1: 오토인코더로 잠재 표현 추출
            with torch.no_grad():
                z0 = autoencoder.encode(batch_x)

            # Step 2: 랜덤 타임스텝 샘플링
            t = torch.randint(0, schedule.timesteps, (z0.shape[0],),
                            device=device)

            # Step 3: 노이즈 샘플링 및 순방향 확산
            noise = torch.randn_like(z0)
            z_t = schedule.q_sample(z0, t, noise)

            # Step 4: 노이즈 예측
            noise_pred = model(z_t, t, class_labels=batch_labels)

            # Step 5: 손실 계산 (Simple MSE)
            loss = F.mse_loss(noise_pred, noise)

            # Step 6: 역전파
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            total_loss += loss.item()

        avg_loss = total_loss / len(dataloader)
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.6f}")

3.7 샘플링 (역방향 과정)

학습된 모델로부터 새로운 유전자 발현 프로파일을 생성한다.

@torch.no_grad()
def sample(model, autoencoder, schedule, n_samples, latent_dim=128,
           class_labels=None, device='cuda'):
    """DDPM 샘플링 — 노이즈에서 유전자 발현 프로파일 생성"""

    model.eval()

    # 순수 가우시안 노이즈에서 시작
    z = torch.randn(n_samples, latent_dim, device=device)

    # 역방향 과정: T → 0
    for t in reversed(range(schedule.timesteps)):
        t_batch = torch.full((n_samples,), t, device=device, dtype=torch.long)

        # 노이즈 예측
        noise_pred = model(z, t_batch, class_labels=class_labels)

        # μθ(xₜ, t) 계산
        alpha_t = schedule.alphas[t]
        alpha_bar_t = schedule.alphas_cumprod[t]
        beta_t = schedule.betas[t]

        mean = (1 / torch.sqrt(alpha_t)) * (
            z - (beta_t / torch.sqrt(1 - alpha_bar_t)) * noise_pred
        )

        # 노이즈 추가 (t > 0일 때만)
        if t > 0:
            noise = torch.randn_like(z)
            sigma = torch.sqrt(beta_t)
            z = mean + sigma * noise
        else:
            z = mean

    # 잠재 표현 → 유전자 발현 프로파일 디코딩
    generated_expression = autoencoder.decode(z)

    return generated_expression

3.8 전체 파이프라인 실행

def run_pipeline():
    """전체 파이프라인: 데이터 로드 → 학습 → 생성"""
    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    # ===== 데이터 준비 (예시) =====
    # 실제로는 scanpy로 h5ad 파일을 로드
    n_cells, n_genes, n_classes = 10000, 2000, 10
    X = torch.randn(n_cells, n_genes)  # 실제: adata.X
    labels = torch.randint(0, n_classes, (n_cells,))  # 실제: cell_type codes

    dataset = TensorDataset(X, labels)
    dataloader = DataLoader(dataset, batch_size=256, shuffle=True)

    # ===== Phase 1: 오토인코더 학습 =====
    latent_dim = 128
    autoencoder = GeneAutoencoder(n_genes, latent_dim).to(device)

    ae_optimizer = torch.optim.Adam(autoencoder.parameters(), lr=1e-3)
    for epoch in range(50):
        for batch_x, _ in dataloader:
            batch_x = batch_x.to(device)
            x_recon, z = autoencoder(batch_x)
            loss = F.mse_loss(x_recon, batch_x)
            ae_optimizer.zero_grad()
            loss.backward()
            ae_optimizer.step()

    # ===== Phase 2: Diffusion Model 학습 =====
    schedule = DiffusionSchedule(timesteps=1000)
    model = OmicsDenoiser(
        latent_dim=latent_dim,
        hidden_dim=512,
        n_layers=6,
        n_classes=n_classes
    ).to(device)

    train_diffusion(model, autoencoder, dataloader, schedule,
                    epochs=100, device=device)

    # ===== Phase 3: 조건부 샘플 생성 =====
    # 특정 세포 유형 (예: 클래스 3)의 세포 100개 생성
    target_labels = torch.full((100,), 3, device=device, dtype=torch.long)
    generated = sample(model, autoencoder, schedule,
                       n_samples=100, class_labels=target_labels,
                       device=device)

    print(f"생성된 데이터 shape: {generated.shape}")  # (100, 2000)
    return generated

Part 4: 평가 및 시각화

생성된 세포 데이터의 품질을 평가하는 주요 지표들:

4.1 평가 지표

지표	설명	이상적 값
MMD (Maximum Mean Discrepancy)	실제/생성 데이터 분포 간 거리	0에 가까울수록 좋음
SCC (Spearman Correlation)	유전자 발현 상관관계	1에 가까울수록 좋음
LISI (Local Inverse Simpson’s Index)	실제/생성 세포의 혼합도	높을수록 좋음
RF AUC (Random Forest)	실제/생성 구분 정확도	0.5에 가까울수록 좋음
Cell Type Accuracy	생성 세포의 유형 분류 정확도	높을수록 좋음

4.2 UMAP 시각화 예시

def evaluate_generated(real_adata, generated_tensor, cell_type_names):
    """생성 데이터 품질 평가 및 UMAP 시각화"""
    import scanpy as sc

    # 생성 데이터를 AnnData로 변환
    gen_adata = ad.AnnData(
        X=generated_tensor.cpu().numpy(),
        var=real_adata.var.copy()
    )
    gen_adata.obs['source'] = 'generated'

    real_subset = real_adata.copy()
    real_subset.obs['source'] = 'real'

    # 결합 후 UMAP
    combined = ad.concat([real_subset, gen_adata])
    sc.pp.pca(combined)
    sc.pp.neighbors(combined)
    sc.tl.umap(combined)

    # 시각화
    sc.pl.umap(combined, color='source', title='Real vs Generated Cells')

정리 및 전망

핵심 요약

Diffusion model은 순방향(노이즈 추가)과 역방향(노이즈 제거) 과정을 통해 데이터를 생성하는 모델로, 단순한 MSE 손실로 학습된다
오믹스 데이터에 적용할 때는 잠재 확산 모델(Latent Diffusion) 접근이 효과적이며, 공간 구조가 없으므로 MLP 기반 디노이저를 사용한다
조건부 생성을 통해 특정 세포 유형, 조직, 질병 상태의 데이터를 선택적으로 생성할 수 있다

향후 발전 방향

이산 확산 모델: DCM(2026)이 보여주듯, 유전자 발현 카운트의 이산적 특성을 직접 모델링하면 연속 확산 대비 5배 이상의 성능 향상이 가능하다. 연속 공간 변환 시 발생하는 정보 손실을 원천적으로 방지할 수 있다
멀티오믹스 확산 모델: 전사체 + 후성유전체 + 단백질체를 동시에 생성하는 통합 모델. scDiffusion-X로 후속 연구가 진행 중이다
공간 전사체 확산 모델: 조직의 공간적 유전자 발현 패턴(spatial transcriptomics)을 생성. Stem(2025)이 H&E 이미지로부터 공간 유전자 발현을 추론하는 시도를 보여주었다
노화 연구 적용: 시간에 따른 세포 상태 변화를 Gradient Interpolation으로 모델링, 노화 과정의 연속적 전사체 변화 재구성
약물 반응 예측: 섭동(perturbation) 조건부 확산 모델로 약물의 전사체 반응 예측. scVAEDer(2025)가 잠재 공간의 벡터 산술로 섭동 반응을 예측하는 접근을 제시했다
Flow Matching: 확산 모델의 느린 생성 속도를 해결하는 차세대 접근법으로, scLDM(2025)이 선형 보간 기반 Flow Matching 손실을 채택하여 더 효율적인 학습과 생성을 달성했다

오믹스 데이터의 생성 모델은 2024년 이후 빠르게 발전하고 있다. 확산 모델의 도입으로 생성 품질이 크게 향상되었으며, 파운데이션 모델과의 결합, 이산 확산, Flow Matching 등 새로운 접근법이 잇따르고 있다. 실험 데이터의 한계를 넘는 in silico 생물학의 새로운 장이 열리고 있다.

References

Diffusion Model 기초 이론

Ho, J., Jain, A. & Abbeel, P. Denoising Diffusion Probabilistic Models. NeurIPS (2020). arXiv:2006.11239
Song, Y. & Ermon, S. Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS (2019). arXiv:1907.05600
Song, Y. et al. Score-Based Generative Modeling through Stochastic Differential Equations. ICLR (2021). arXiv:2011.13456
Nichol, A. & Dhariwal, P. Improved Denoising Diffusion Probabilistic Models. ICML (2021). arXiv:2102.09672
Rombach, R. et al. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR (2022). arXiv:2112.10752
Ho, J. & Salimans, T. Classifier-Free Diffusion Guidance. NeurIPS Workshop (2022). arXiv:2207.12598

오믹스 Diffusion Model

Luo, E. et al. scDiffusion: conditional generation of high-quality single-cell data using diffusion model. Bioinformatics 40, btae518 (2024). DOI
Zhang, H. et al. cfDiffusion: Conditional generation of high-quality single-cell data via classifier-free guidance. Brief. Bioinform. 26, bbaf071 (2025). DOI
Gao, H. et al. scVAEDer: Integrating deep diffusion models and variational autoencoders for single-cell transcriptomics analysis. Genome Biol. 26, 60 (2025). DOI
scLDM: A Scalable Latent Diffusion Model for single-cell data. arXiv (2025). arXiv:2511.02986
Guo, Z. et al. Diffusion models in bioinformatics and computational biology. Nat. Rev. Bioeng. 2, 136–154 (2024). DOI

Diffusion Model 기초 이론과 오믹스 데이터 적용 가이드AI Generated