YouAndMeNet - Diffusion v2 Technical Documentation X

YouAndMeNet

Conditional Diffusion Model for Art Object Generation

DIFFUSION v2 ACTIVE
What is Diffusion? - Technical Background

The Core Idea

Diffusion models learn to reverse a gradual noising process. Instead of learning to generate images directly (like GANs), they learn to denoise images step by step.

Forward Process (Adding Noise)

We gradually add Gaussian noise to an image over T timesteps until it becomes pure noise:

x0x1x2 → ... → xT (pure noise)

At each step: xt = √αt · x0 + √(1-αt) · ε
where ε ~ N(0, I) is random Gaussian noise

Reverse Process (Denoising)

The neural network learns to predict the noise that was added, allowing us to reverse the process:

xT (noise) → xT-1 → ... → x0 (image)

Model predicts: εθ(xt, t, c) ≈ noise added at step t
Given condition c (our 32 features)

Why This Works

  • Simpler learning: Predicting noise is easier than predicting entire images
  • Stable training: Simple MSE loss, no adversarial min-max game
  • High quality: Many small denoising steps = fine-grained control
  • Conditional generation: Just inject condition into each denoising step
Model Overview
~48M
PARAMETERS
32
FEATURES
256×256
RESOLUTION
1000
TIMESTEPS
50
DDIM STEPS

What it does:

  • Takes 32 feature values describing color, lighting, material, shape, topology
  • Generates 256×256 images that match those features
  • Trained ONLY on my own photos (~3000 images from camera roll, glasswork)
  • No pretrained weights - trained from scratch
  • No external data - only my photographs
Code Deep Dive - Hover for Explanations MY CODE

1. ResBlock - The Building Block

Every level of the U-Net uses ResBlocks. They process features while injecting time+condition information:

class ResBlock(nn.Module): """x → GN → SiLU → Conv → (+emb) → GN → SiLU → Dropout → Conv → (+skip)""" def __init__(self, in_ch, out_ch, emb_dim, dropout=0.1): self.conv1 = nn.Sequential( nn.GroupNorm(32, in_ch), nn.SiLU(), nn.Conv2d(in_ch, out_ch, 3, padding=1) ) self.emb_proj = nn.Sequential( nn.SiLU(), nn.Linear(emb_dim, out_ch) ) self.conv2 = nn.Sequential( nn.GroupNorm(32, out_ch), nn.SiLU(), nn.Dropout(dropout), nn.Conv2d(out_ch, out_ch, 3, padding=1) ) # Skip connection: 1x1 conv if channels change, otherwise identity self.skip = nn.Conv2d(in_ch, out_ch, 1) if in_ch != out_ch else nn.Identity() def forward(self, x, emb): h = self.conv1(x) # [B, out_ch, H, W] h = h + self.emb_proj(emb)[:, :, None, None] # Inject embedding! h = self.conv2(h) # [B, out_ch, H, W] return h + self.skip(x) # Residual connection

2. Sinusoidal Time Embedding FROM PAPER

We encode the timestep t (0-999) into a 64-dimensional vector using sine/cosine functions at different frequencies:

class SinusoidalEmbedding(nn.Module): """t (int) → embedding (64,)""" def forward(self, t): half_dim = self.dim // 2 # 32 emb = math.log(10000) / (half_dim - 1) # ~0.29 emb = torch.exp(torch.arange(half_dim) * -emb) # [32] frequencies emb = t[:, None] * emb[None, :] # [B, 32] - t × each freq return torch.cat([emb.sin(), emb.cos()], dim=-1) # [B, 64]

Why this works: Different timesteps need different information. Early steps (t≈0) need fine details, late steps (t≈999) need global structure. The varying frequencies encode both.

3. Self-Attention MY CODE

Applied at 32×32 and 16×16 resolutions. Lets the model look at all spatial positions when generating each pixel:

class SelfAttention(nn.Module): def __init__(self, channels, num_heads=4): self.head_dim = channels // num_heads self.qkv = nn.Conv2d(channels, channels * 3, 1) self.proj = nn.Conv2d(channels, channels, 1) def forward(self, x): B, C, H, W = x.shape residual = x # Compute Q, K, V qkv = self.qkv(self.norm(x)).reshape(B, 3, self.num_heads, self.head_dim, H*W) q, k, v = qkv[:, 0], qkv[:, 1], qkv[:, 2] # Each [B, heads, head_dim, H*W] # Scaled dot-product attention attn = torch.matmul(q.transpose(-2,-1), k) * (self.head_dim ** -0.5) attn = F.softmax(attn, dim=-1) # [B, heads, H*W, H*W] # Apply attention to values out = torch.matmul(attn, v.transpose(-2,-1)) # [B, heads, H*W, head_dim] out = out.reshape(B, C, H, W) # Back to spatial return self.proj(out) + residual

4. DDIM Sampling MY IMPLEMENTATION

DDIM (Denoising Diffusion Implicit Models) allows fast sampling in 50 steps instead of 1000:

def ddim_sample(self, cond, ddim_steps=50, eta=0.0): # Start with pure noise x = torch.randn((batch_size, 3, 256, 256)) # Take larger steps: 1000/50 = every 20th timestep timesteps = list(range(0, 1000, 1000//ddim_steps))[::-1] for t in timesteps: # Model predicts the noise that was added pred_noise = self.model(x, t, cond) # DDIM formula to predict original image alpha_t = self.alphas_cumprod[t] pred_x0 = (x - sqrt(1 - alpha_t) * pred_noise) / sqrt(alpha_t) pred_x0 = clamp(pred_x0, -1, 1) # Move to next (less noisy) timestep alpha_prev = self.alphas_cumprod[t_prev] x = sqrt(alpha_prev) * pred_x0 + sqrt(1 - alpha_prev) * pred_noise return x
Architecture Diagram 100% MY CODE
YouAndMeNet - U-Net Architecture (~48M parameters) ════════════════════════════════════════════════════════════════════════════ INPUTS: x [B, 4, 256, 256] Noisy image (3 RGB) + mask (1 channel) t [B] Timestep (0-999) cond [B, 32] Conditioning features ┌───────────────────────────────────────┐ │ EMBEDDING PIPELINE │ │ │ │ t ──→ SinusoidalEmb(64) ──→ MLP ──┐ │ │ │ │ │ cond ──→ MLP(32→256) ─────────────┼──┼──→ emb [B, 256] │ │ │ │ (add)┘ │ └───────────────────────────────────────┘ │ │ emb injected at every ResBlock ↓ ┌───────────────────────────────────────────────────────────────────────────┐ │ ENCODER │ ├───────────────────────────────────────────────────────────────────────────┤ │ │ │ Level 0: 256×256 64 ch [ResBlock ×2] ────────┐ │ │ ↓ Downsample (stride 2 conv) │ │ │ │ │ │ Level 1: 128×128 128 ch [ResBlock ×2] ──────┐ │ │ │ ↓ Downsample │ │ │ │ │ │ │ │ Level 2: 64×64 256 ch [ResBlock ×2] ────┐ │ │ │ │ ↓ Downsample │ │ │ │ │ │ │ │ │ │ Level 3: 32×32 512 ch [ResBlock ×2] + [SelfAttention] ──┐ │ │ │ │ │ ↓ Downsample │ │ │ │ │ │ │ │ │ │ │ │ Level 4: 16×16 512 ch [ResBlock ×2] + [SelfAttention] │ │ │ │ │ │ │ │ │ │ │ └─────────────────────────────────────────────────────────────────┼──┼─┼─┼──┘ │ │ │ │ │ ┌───────────────────┼─────────────────────────┼──┼─┼─┼──┐ │ BOTTLENECK (8×8, 512 ch) │ │ │ │ │ │ │ │ │ │ │ │ [ResBlock] + [SelfAttention] + │ │ │ │ │ │ [ResBlock] │ │ │ │ │ │ │ │ │ │ │ └───────────────────┼─────────────────────────┼──┼─┼─┼──┘ │ │ │ │ │ ┌───────────────────────────────────────┼─────────────────────────┼──┼─┼─┼──┐ │ DECODER │ │ │ │ │ ├───────────────────────────────────────────────────────────────────────────┤ │ │ │ │ │ │ │ Level 4: 16×16 ← concat skip ──────────────────────────────┘ │ │ │ │ │ [ResBlock ×2] + [SelfAttention] │ │ │ │ │ ↑ Upsample (nearest + conv) │ │ │ │ │ │ │ │ │ │ Level 3: 32×32 ← concat skip ─────────────────────────────────┘ │ │ │ │ [ResBlock ×2] + [SelfAttention] │ │ │ │ ↑ Upsample │ │ │ │ │ │ │ │ Level 2: 64×64 ← concat skip ───────────────────────────────────┘ │ │ │ [ResBlock ×2] │ │ │ ↑ Upsample │ │ │ │ │ │ Level 1: 128×128 ← concat skip ─────────────────────────────────────┘ │ │ [ResBlock ×2] │ │ ↑ Upsample │ │ │ │ Level 0: 256×256 ← concat skip ────────────────────────────────────────┘ │ [ResBlock ×2] │ │ │ └───────────────────────────────────────────────────────────────────────────┘ │ ┌───────────────────┼───────────────────┐ │ FINAL LAYER │ │ │ │ [GroupNorm] → [SiLU] → [Conv 3×3] │ │ │ └───────────────────┼───────────────────┘ │ ↓ OUTPUT: [B, 3, 256, 256] (predicted noise ε)
Why Non-Semantic Features? MY DESIGN DECISION

The Problem with Semantic Descriptions

Most modern image AI uses text prompts: "a photo of a blue glass vase in soft lighting". But text descriptions are:

  • Subjective - "soft lighting" means different things to different people
  • Incomplete - you can't describe every visual detail in words
  • Lossy - information is lost when converting visuals to language
  • Biased - trained on internet captions with their cultural assumptions

My Approach: Numerical Image Tokens

I got the idea by trying to reverse-engineer images: what if instead of describing an image with words, I could describe it with numbers that capture visual properties directly?

Traditional Text Conditioning:
"blue vase" → CLIP encoder → [0.2, -0.1, 0.8, ...] (768-dim semantic embedding)

My Non-Semantic Conditioning:
Image → Feature extractors → [hue=0.6, sat=0.7, circularity=0.4, ...] (32-dim visual properties)

This is like building a token vocabulary for images - but instead of discrete word tokens, I use continuous visual measurements. Each feature is:

  • Objective - computed directly from pixels, not interpreted
  • Complete - captures aspects words can't express
  • Invertible - the same features can reconstruct the same visual style
  • Language-independent - works without any text training data

Why 32 Features?

I experimented with different feature counts (16, 32, 64, 128). 32 features provide a good balance:

  • Enough dimensions to capture distinct visual properties
  • Small enough that the model can learn their meaning
  • Each feature is interpretable and controllable

Think of it as a 32-dimensional visual space where each axis represents a measurable property.

32 Conditioning Features MY EXTRACTION CODE

Each image in the dataset has 32 features extracted. During generation, you specify these features to control the output:

COLOR (8)

  • dominant_hue - Main color (0-1, maps to 0-360°)
  • dominant_sat - Saturation of main color
  • dominant_val - Brightness of main color
  • color_variety - How many distinct colors
  • temperature - Warm (red/yellow) vs cool (blue)
  • sat_variation - Saturation range
  • color_contrast - Difference between colors
  • high_sat_ratio - % of vivid colors

LIGHTING (6)

  • specular_highlights - Shiny spots
  • shadow_intensity - How dark shadows are
  • shadow_ratio - % of image in shadow
  • brightness_var - Light variation
  • local_contrast - Edge sharpness
  • obj_bg_contrast - Object vs background

MATERIAL (5)

  • reflectivity - Mirror-like quality
  • transparency - See-through quality
  • roughness - Texture graininess
  • metallic - Metal-like appearance
  • color_uniformity - Solid vs varied

SHAPE (7)

  • circularity - How round (0-1)
  • solidity - Filled vs hollow
  • spectral_entropy - Frequency complexity
  • graph_density - Edge connectivity
  • mean_curvature - Average bendiness
  • curv_variance - Curvature variation
  • affine_curvature - Scale-invariant curve

TOPOLOGY (6)

  • euler_char - V - E + F formula
  • genus - Number of "handles"
  • num_holes - Holes in object
  • complexity - Overall intricacy
  • spectral_gap - Graph eigenvalue gap
  • betti_1 - 1D topology number
Training Progression Samples
Transparency: How This Was Made

Background

This project was developed over several months while studying programming at vocational school (1.5 years). The code was written in dialogue with AI assistance (Claude) - I would write code, discuss problems, understand solutions, and then implement them in my own way.

What I Created

  • The core concept: Using non-semantic numerical features instead of text prompts - my own idea from trying to "reverse-engineer" how to describe images without words
  • Architecture decisions: Channel progression (64→128→256→512→512), attention placement at 32×32 and 16×16, number of ResBlocks per level
  • The 32 feature system: Selecting which visual properties matter (color temperature, topology, curvature, etc.) and how to normalize them
  • Project structure: Organizing files, naming conventions, splitting code into modules
  • All experiments: 18 failed GAN versions before switching to diffusion, each with different approaches
  • Dataset: ~7000+ of my own photographs, manually curated

Examples of Code I Wrote

WhatExample of my decisions
Training parametersChoosing batch_size=4, lr=1e-4, 1000 timesteps, cosine schedule
Model configbase_channels=64, channel_mult=[1,2,4,8,8], num_res_blocks=2
Feature selectionDeciding to use Euler characteristic and Betti numbers for topology
File organizationSplitting into diffusion_model.py, train_diffusion.py, build_features.py
Logging & checkpointsSaving samples every N epochs, tensorboard integration
Data augmentationChoosing which transforms to apply (flips, color jitter, etc.)
DebuggingFinding why training diverged, fixing NaN losses, memory optimization

What AI Helped With

I used both ChatGPT and Claude Code throughout the project. They brought different perspectives and strengths, which helped me understand problems from multiple angles.

AreaType of help
Monotonous calculationsWriting repetitive code patterns, boilerplate, tensor reshaping operations
Math implementationTranslating paper equations into PyTorch (alpha schedules, noise sampling, loss functions)
Concrete examplesShowing how a concept looks in actual code, not just theory
Implementing principlesTurning design ideas ("I want attention here") into working code
Finding links & referencesLocating relevant papers, documentation, and tutorials
DebuggingExplaining error messages, finding tensor shape mismatches, memory issues
Library usageCorrect syntax for OpenCV, scikit-image, PyTorch, torchvision functions
ConceptsExplaining attention mechanisms, normalization techniques, loss landscapes
Code structureSuggesting how to organize classes, split files, write docstrings
Different perspectivesChatGPT and Claude often suggested different approaches - comparing them helped me learn

Libraries Used

Papers Referenced

  • DDPM - Ho, Jain, Abbeel: "Denoising Diffusion Probabilistic Models" (NeurIPS 2020)
  • DDIM - Song, Meng, Ermon: "Denoising Diffusion Implicit Models" (ICLR 2021)
  • Improved DDPM - Nichol, Dhariwal: "Improved Denoising Diffusion Probabilistic Models" (2021)
  • U-Net - Ronneberger et al.: "U-Net: Convolutional Networks for Biomedical Image Segmentation" (2015)
  • Attention - Vaswani et al.: "Attention Is All You Need" (NeurIPS 2017)
  • Group Norm - Wu, He: "Group Normalization" (ECCV 2018)
IN SUMMARY
This project was developed in dialogue with AI, combining my 1.5 years of programming studies with AI assistance for implementation details. The ideas, design decisions, experiments, and dataset are my own work. No pretrained weights were used - the model was trained from scratch on ~7000 of my own photos.