YouAndMeNet - Diffusion v2

What is Diffusion? - Technical Background

The Core Idea

Diffusion models learn to reverse a gradual noising process. Instead of learning to generate images directly (like GANs), they learn to denoise images step by step.

Forward Process (Adding Noise)

We gradually add Gaussian noise to an image over T timesteps until it becomes pure noise:

x₀ → x₁ → x₂ → ... → x_T (pure noise)

At each step: x_t = √α_t · x₀ + √(1-α_t) · ε
where ε ~ N(0, I) is random Gaussian noise

Reverse Process (Denoising)

The neural network learns to predict the noise that was added, allowing us to reverse the process:

x_T (noise) → x_T-1 → ... → x₀ (image)

Model predicts: ε_θ(x_t, t, c) ≈ noise added at step t
Given condition c (our 32 features)

Why This Works

Simpler learning: Predicting noise is easier than predicting entire images
Stable training: Simple MSE loss, no adversarial min-max game
High quality: Many small denoising steps = fine-grained control
Conditional generation: Just inject condition into each denoising step

Model Overview

~48M

PARAMETERS

32

FEATURES

256×256

RESOLUTION

1000

TIMESTEPS

50

DDIM STEPS

What it does:

Takes 32 feature values describing color, lighting, material, shape, topology
Generates 256×256 images that match those features
Trained ONLY on my own photos (~3000 images from camera roll, glasswork)
No pretrained weights - trained from scratch
No external data - only my photographs

Code Deep Dive - Hover for Explanations MY CODE

1. ResBlock - The Building Block

Every level of the U-Net uses ResBlocks. They process features while injecting time+condition information:

class ResBlock(nn.Module): """x → GN → SiLU → Conv → (+emb) → GN → SiLU → Dropout → Conv → (+skip)""" def __init__(self, in_ch, out_ch, emb_dim, dropout=0.1): self.conv1 = nn.Sequential( nn.GroupNorm(32, in_ch), nn.SiLU(), nn.Conv2d(in_ch, out_ch, 3, padding=1) ) self.emb_proj = nn.Sequential( nn.SiLU(), nn.Linear(emb_dim, out_ch) ) self.conv2 = nn.Sequential( nn.GroupNorm(32, out_ch), nn.SiLU(), nn.Dropout(dropout), nn.Conv2d(out_ch, out_ch, 3, padding=1) ) # Skip connection: 1x1 conv if channels change, otherwise identity self.skip = nn.Conv2d(in_ch, out_ch, 1) if in_ch != out_ch else nn.Identity() def forward(self, x, emb): h = self.conv1(x) # [B, out_ch, H, W] h = h + self.emb_proj(emb)[:, :, None, None] # Inject embedding! h = self.conv2(h) # [B, out_ch, H, W] return h + self.skip(x) # Residual connection

2. Sinusoidal Time Embedding FROM PAPER

We encode the timestep t (0-999) into a 64-dimensional vector using sine/cosine functions at different frequencies:

class SinusoidalEmbedding(nn.Module): """t (int) → embedding (64,)""" def forward(self, t): half_dim = self.dim // 2 # 32 emb = math.log(10000) / (half_dim - 1) # ~0.29 emb = torch.exp(torch.arange(half_dim) * -emb) # [32] frequencies emb = t[:, None] * emb[None, :] # [B, 32] - t × each freq return torch.cat([emb.sin(), emb.cos()], dim=-1) # [B, 64]

Why this works: Different timesteps need different information. Early steps (t≈0) need fine details, late steps (t≈999) need global structure. The varying frequencies encode both.

3. Self-Attention MY CODE

Applied at 32×32 and 16×16 resolutions. Lets the model look at all spatial positions when generating each pixel:

class SelfAttention(nn.Module): def __init__(self, channels, num_heads=4): self.head_dim = channels // num_heads self.qkv = nn.Conv2d(channels, channels * 3, 1) self.proj = nn.Conv2d(channels, channels, 1) def forward(self, x): B, C, H, W = x.shape residual = x # Compute Q, K, V qkv = self.qkv(self.norm(x)).reshape(B, 3, self.num_heads, self.head_dim, H*W) q, k, v = qkv[:, 0], qkv[:, 1], qkv[:, 2] # Each [B, heads, head_dim, H*W] # Scaled dot-product attention attn = torch.matmul(q.transpose(-2,-1), k) * (self.head_dim ** -0.5) attn = F.softmax(attn, dim=-1) # [B, heads, H*W, H*W] # Apply attention to values out = torch.matmul(attn, v.transpose(-2,-1)) # [B, heads, H*W, head_dim] out = out.reshape(B, C, H, W) # Back to spatial return self.proj(out) + residual

4. DDIM Sampling MY IMPLEMENTATION

DDIM (Denoising Diffusion Implicit Models) allows fast sampling in 50 steps instead of 1000:

def ddim_sample(self, cond, ddim_steps=50, eta=0.0): # Start with pure noise x = torch.randn((batch_size, 3, 256, 256)) # Take larger steps: 1000/50 = every 20th timestep timesteps = list(range(0, 1000, 1000//ddim_steps))[::-1] for t in timesteps: # Model predicts the noise that was added pred_noise = self.model(x, t, cond) # DDIM formula to predict original image alpha_t = self.alphas_cumprod[t] pred_x0 = (x - sqrt(1 - alpha_t) * pred_noise) / sqrt(alpha_t) pred_x0 = clamp(pred_x0, -1, 1) # Move to next (less noisy) timestep alpha_prev = self.alphas_cumprod[t_prev] x = sqrt(alpha_prev) * pred_x0 + sqrt(1 - alpha_prev) * pred_noise return x

Architecture Diagram 100% MY CODE

YouAndMeNet - U-Net Architecture (~48M parameters) ════════════════════════════════════════════════════════════════════════════ INPUTS: x [B, 4, 256, 256] Noisy image (3 RGB) + mask (1 channel) t [B] Timestep (0-999) cond [B, 32] Conditioning features ┌───────────────────────────────────────┐ │ EMBEDDING PIPELINE │ │ │ │ t ──→ SinusoidalEmb(64) ──→ MLP ──┐ │ │ │ │ │ cond ──→ MLP(32→256) ─────────────┼──┼──→ emb [B, 256] │ │ │ │ (add)┘ │ └───────────────────────────────────────┘ │ │ emb injected at every ResBlock ↓ ┌───────────────────────────────────────────────────────────────────────────┐ │ ENCODER │ ├───────────────────────────────────────────────────────────────────────────┤ │ │ │ Level 0: 256×256 64 ch [ResBlock ×2] ────────┐ │ │ ↓ Downsample (stride 2 conv) │ │ │ │ │ │ Level 1: 128×128 128 ch [ResBlock ×2] ──────┐ │ │ │ ↓ Downsample │ │ │ │ │ │ │ │ Level 2: 64×64 256 ch [ResBlock ×2] ────┐ │ │ │ │ ↓ Downsample │ │ │ │ │ │ │ │ │ │ Level 3: 32×32 512 ch [ResBlock ×2] + [SelfAttention] ──┐ │ │ │ │ │ ↓ Downsample │ │ │ │ │ │ │ │ │ │ │ │ Level 4: 16×16 512 ch [ResBlock ×2] + [SelfAttention] │ │ │ │ │ │ │ │ │ │ │ └─────────────────────────────────────────────────────────────────┼──┼─┼─┼──┘ │ │ │ │ │ ┌───────────────────┼─────────────────────────┼──┼─┼─┼──┐ │ BOTTLENECK (8×8, 512 ch) │ │ │ │ │ │ │ │ │ │ │ │ [ResBlock] + [SelfAttention] + │ │ │ │ │ │ [ResBlock] │ │ │ │ │ │ │ │ │ │ │ └───────────────────┼─────────────────────────┼──┼─┼─┼──┘ │ │ │ │ │ ┌───────────────────────────────────────┼─────────────────────────┼──┼─┼─┼──┐ │ DECODER │ │ │ │ │ ├───────────────────────────────────────────────────────────────────────────┤ │ │ │ │ │ │ │ Level 4: 16×16 ← concat skip ──────────────────────────────┘ │ │ │ │ │ [ResBlock ×2] + [SelfAttention] │ │ │ │ │ ↑ Upsample (nearest + conv) │ │ │ │ │ │ │ │ │ │ Level 3: 32×32 ← concat skip ─────────────────────────────────┘ │ │ │ │ [ResBlock ×2] + [SelfAttention] │ │ │ │ ↑ Upsample │ │ │ │ │ │ │ │ Level 2: 64×64 ← concat skip ───────────────────────────────────┘ │ │ │ [ResBlock ×2] │ │ │ ↑ Upsample │ │ │ │ │ │ Level 1: 128×128 ← concat skip ─────────────────────────────────────┘ │ │ [ResBlock ×2] │ │ ↑ Upsample │ │ │ │ Level 0: 256×256 ← concat skip ────────────────────────────────────────┘ │ [ResBlock ×2] │ │ │ └───────────────────────────────────────────────────────────────────────────┘ │ ┌───────────────────┼───────────────────┐ │ FINAL LAYER │ │ │ │ [GroupNorm] → [SiLU] → [Conv 3×3] │ │ │ └───────────────────┼───────────────────┘ │ ↓ OUTPUT: [B, 3, 256, 256] (predicted noise ε)

Why Non-Semantic Features? MY DESIGN DECISION

The Problem with Semantic Descriptions

Most modern image AI uses text prompts: "a photo of a blue glass vase in soft lighting". But text descriptions are:

Subjective - "soft lighting" means different things to different people
Incomplete - you can't describe every visual detail in words
Lossy - information is lost when converting visuals to language
Biased - trained on internet captions with their cultural assumptions

My Approach: Numerical Image Tokens

I got the idea by trying to reverse-engineer images: what if instead of describing an image with words, I could describe it with numbers that capture visual properties directly?

Traditional Text Conditioning:
"blue vase" → CLIP encoder → [0.2, -0.1, 0.8, ...] (768-dim semantic embedding)

My Non-Semantic Conditioning:
Image → Feature extractors → [hue=0.6, sat=0.7, circularity=0.4, ...] (32-dim visual properties)

This is like building a token vocabulary for images - but instead of discrete word tokens, I use continuous visual measurements. Each feature is:

Objective - computed directly from pixels, not interpreted
Complete - captures aspects words can't express
Invertible - the same features can reconstruct the same visual style
Language-independent - works without any text training data

Why 32 Features?

I experimented with different feature counts (16, 32, 64, 128). 32 features provide a good balance:

Enough dimensions to capture distinct visual properties
Small enough that the model can learn their meaning
Each feature is interpretable and controllable

Think of it as a 32-dimensional visual space where each axis represents a measurable property.

32 Conditioning Features MY EXTRACTION CODE

Each image in the dataset has 32 features extracted. During generation, you specify these features to control the output:

COLOR (8)

dominant_hue - Main color (0-1, maps to 0-360°)
dominant_sat - Saturation of main color
dominant_val - Brightness of main color
color_variety - How many distinct colors
temperature - Warm (red/yellow) vs cool (blue)
sat_variation - Saturation range
color_contrast - Difference between colors
high_sat_ratio - % of vivid colors

LIGHTING (6)

specular_highlights - Shiny spots
shadow_intensity - How dark shadows are
shadow_ratio - % of image in shadow
brightness_var - Light variation
local_contrast - Edge sharpness
obj_bg_contrast - Object vs background

MATERIAL (5)

reflectivity - Mirror-like quality
transparency - See-through quality
roughness - Texture graininess
metallic - Metal-like appearance
color_uniformity - Solid vs varied

SHAPE (7)

circularity - How round (0-1)
solidity - Filled vs hollow
spectral_entropy - Frequency complexity
graph_density - Edge connectivity
mean_curvature - Average bendiness
curv_variance - Curvature variation
affine_curvature - Scale-invariant curve

TOPOLOGY (6)

euler_char - V - E + F formula
genus - Number of "handles"
num_holes - Holes in object
complexity - Overall intricacy
spectral_gap - Graph eigenvalue gap
betti_1 - 1D topology number

Training Progression Samples

Transparency: How This Was Made

Background

This project was developed over several months while studying programming at vocational school (1.5 years). The code was written in dialogue with AI assistance (Claude) - I would write code, discuss problems, understand solutions, and then implement them in my own way.

What I Created

The core concept: Using non-semantic numerical features instead of text prompts - my own idea from trying to "reverse-engineer" how to describe images without words
Architecture decisions: Channel progression (64→128→256→512→512), attention placement at 32×32 and 16×16, number of ResBlocks per level
The 32 feature system: Selecting which visual properties matter (color temperature, topology, curvature, etc.) and how to normalize them
Project structure: Organizing files, naming conventions, splitting code into modules
All experiments: 18 failed GAN versions before switching to diffusion, each with different approaches
Dataset: ~7000+ of my own photographs, manually curated

Examples of Code I Wrote

What	Example of my decisions
Training parameters	Choosing batch_size=4, lr=1e-4, 1000 timesteps, cosine schedule
Model config	base_channels=64, channel_mult=[1,2,4,8,8], num_res_blocks=2
Feature selection	Deciding to use Euler characteristic and Betti numbers for topology
File organization	Splitting into diffusion_model.py, train_diffusion.py, build_features.py
Logging & checkpoints	Saving samples every N epochs, tensorboard integration
Data augmentation	Choosing which transforms to apply (flips, color jitter, etc.)
Debugging	Finding why training diverged, fixing NaN losses, memory optimization

What AI Helped With

I used both ChatGPT and Claude Code throughout the project. They brought different perspectives and strengths, which helped me understand problems from multiple angles.

Area	Type of help
Monotonous calculations	Writing repetitive code patterns, boilerplate, tensor reshaping operations
Math implementation	Translating paper equations into PyTorch (alpha schedules, noise sampling, loss functions)
Concrete examples	Showing how a concept looks in actual code, not just theory
Implementing principles	Turning design ideas ("I want attention here") into working code
Finding links & references	Locating relevant papers, documentation, and tutorials
Debugging	Explaining error messages, finding tensor shape mismatches, memory issues
Library usage	Correct syntax for OpenCV, scikit-image, PyTorch, torchvision functions
Concepts	Explaining attention mechanisms, normalization techniques, loss landscapes
Code structure	Suggesting how to organize classes, split files, write docstrings
Different perspectives	ChatGPT and Claude often suggested different approaches - comparing them helped me learn

Libraries Used

PyTorch - Neural network framework
torchvision - Image transforms and utilities
OpenCV - Computer vision operations
scikit-image - Image processing algorithms
NumPy - Numerical computing
tqdm - Progress bars
TensorBoard - Training visualization

Papers Referenced

DDPM - Ho, Jain, Abbeel: "Denoising Diffusion Probabilistic Models" (NeurIPS 2020)
DDIM - Song, Meng, Ermon: "Denoising Diffusion Implicit Models" (ICLR 2021)
Improved DDPM - Nichol, Dhariwal: "Improved Denoising Diffusion Probabilistic Models" (2021)
U-Net - Ronneberger et al.: "U-Net: Convolutional Networks for Biomedical Image Segmentation" (2015)
Attention - Vaswani et al.: "Attention Is All You Need" (NeurIPS 2017)
Group Norm - Wu, He: "Group Normalization" (ECCV 2018)

IN SUMMARY
This project was developed in dialogue with AI, combining my 1.5 years of programming studies with AI assistance for implementation details. The ideas, design decisions, experiments, and dataset are my own work. No pretrained weights were used - the model was trained from scratch on ~7000 of my own photos.