C:\min_bild_ai\archive.exe
_
x

min_bild_ai

Complete archive of a from-scratch sculpture generator
Artist: Nadja Barenje Lindgren
Project: MFA thesis work, Konstfack University of Arts, Crafts and Design
Period: November 2025 - April 2026
Experiments: 46 documented runs across 8 phases
Total GPU time: ~400 hours on a single consumer graphics card
Hardware: NVIDIA GTX 1660 Super (6 GB VRAM, ~6.5 TFLOPS FP32)
Dataset: 7,843 photographs from the artist's camera roll (256x256 px)
Principle: Everything trained from scratch. No pretrained models. No external data.
System freezes survived: 2 (requiring hard power-off and checkpoint recovery)
"A sculpture is anything centered in the image,
with defined edges and figure-ground separation."

This is a STRUCTURAL definition, not a semantic one.
The machine must learn form, not meaning.

No pretrained models. No shortcuts. No compromises.
After 400 hours and 46 experiments, the answer is not an architecture.
The answer is a curriculum.
Timeline: 8 months, 400 hours, 46 experiments
I: StyleGAN2 x17
CRASH
II: Diffusion x5
III: Progressive x5
IV: Cond x9
V: VAE x3
VI: Token x6
VII: Return
PhaseArchitectureExperimentsGPU TimeKey Result
IStyleGAN217~150hAll mode-collapsed. 30M params too many for 7K images.
IIDiffusion (pixel, latent, edge, slot)5~80hMSE averages to dataset mean. Only textures, no objects.
IIIProgressive GAN5~15hBREAKTHROUGH. Forms at 128x128. Figure-ground separation.
IVConditional GAN / AC-GAN9~35hCategories melt together. Discriminator dominance.
VVQ-VAE3~10hWorking tokenizer (loss 0.0009). Near-perfect reconstruction.
VIToken Transformer / MaskGIT6~80hToken imbalance: 4 of 512 codes = 76.5% of data. Mode collapse.
VIIProgressive GAN (return)2~5hConfirmed best. Continued to epoch 100.
After 400 hours across every major generative architecture family (GAN, Diffusion, VAE, Transformer, Flow Matching), the central finding: "Architecture does NOT solve data problems." 7,000 heterogeneous images + from-scratch training + unconditional generation = a mathematically underdetermined system. The only architecture that succeeded was the one that decomposed the problem into a curriculum: progressive growing from 4x4 to 128x128, learning coarse structure before fine detail.

What worked

  • Progressive growing (coarse-to-fine curriculum)
  • VQ-VAE encoder/decoder (0.0009 total loss)
  • Small models (~4.5M params) on small data
  • WGAN-GP loss (more stable than vanilla GAN)
  • Watchdog scripts for crash recovery

What failed

  • StyleGAN2 (30M params = overfitting)
  • Pixel-space diffusion (MSE averages everything)
  • Unconditional token generation (underdetermined)
  • Multi-class conditioning (too few per class)
  • Edge-first approaches (too sparse for GAN)

Principles maintained throughout

  • NO pretrained models (CLIP, Inception, SD, LoRA)
  • NO "mood generator" compromises accepted
  • NO repeated experiments (each must be novel)
  • Everything learned from the artist's data alone

Hardware constraints

  • GTX 1660 Super: 6 GB VRAM, ~6.5 TFLOPS
  • Max batch size: 12 (larger causes OOM freeze)
  • GPU temp limit: 80C (watchdog kills process)
  • 2 system freezes survived via checkpoint recovery
The Dataset: 7,843 photographs from an artist's life

The entire project is built on a single dataset: 7,843 photographs extracted from the artist's camera roll, resized to 256x256 pixels. This is the only training data. No external images, no augmentation from other sources, no synthetic data.

The dataset is radically heterogeneous. It contains everything an artist photographs over years of practice: glass sculptures in various stages of completion, ceramic works, metal castings, studio environments, gallery installations, exhibition documentation, close-ups of material surfaces, text documents, screenshots, personal photographs, nature, objects found on the street, reference images, and everyday life.

This heterogeneity is both the defining challenge and the philosophical core of the project. A pretrained model (Stable Diffusion, DALL-E) would impose its own visual vocabulary learned from billions of internet images. Training from scratch on this specific collection means the model can only learn from this particular visual world. Every texture it generates, every color it chooses, every form it produces comes exclusively from Nadja's photographic practice.

Dataset variants used across experiments

Dataset variantImagesSizePurpose
dataset_v15_actual (primary)7,8432.0 GBFull original camera roll extraction
dataset_no_gray6,8061.9 GBRemoved grayscale/B&W images
dataset_isolated4,133471 MBImages with isolated objects on backgrounds
dataset_strict1,444176 MBMost strictly curated: centered objects only
dataset_curated_v2750390 MBHand-picked best sculpture images
dataset_curated917120 MBFirst curation pass
curated_sculptures30090 MBTop 300 sculpture-like images
dataset_edges7,501124 MBSobel edge maps of all images
dataset_glass_only6723 MBGlass objects only (too few for training)
dataset_glass_v211643 MBExpanded glass set (still too few)

Preprocessing pipeline

Beyond the raw images, extensive feature extraction was performed to support conditional generation and analysis:

  • Feature vectors (6.6 GB): 32-dimensional descriptors per image including color statistics, edge density, symmetry, and texture measures
  • SAM segmentation masks (35 MB): Automatic object segmentation using Segment Anything (applied to data only, not to model training)
  • Edge maps (124 MB): Sobel-filtered versions for edge-aware training experiments
  • VQ-VAE tokens (246 MB): Pre-computed discrete token representations for transformer experiments

The fundamental data problem

Modern generative models are typically trained on millions of images (FFHQ: 70K faces, LAION: 5B image-text pairs, ImageNet: 14M images). This project uses 7,843 heterogeneous images. Successful from-scratch training at this scale requires either: (a) a very homogeneous dataset (all similar images), or (b) a very small model, or (c) a training curriculum that decomposes the problem. The project discovered option (c).

The dataset defines the ceiling. No architecture can generate objects the data doesn't contain enough examples of. Glass-only training (67-116 images) immediately mode-collapses. The full heterogeneous dataset produces diverse but unstructured outputs. Only strict curation + progressive training produced sculptural forms.
Phase I: StyleGAN2 (EXP-001 to EXP-017) 17 EXPERIMENTS FAILED

Period: November 2025 - February 2026 | Duration: ~150 hours | Experiments: 17

The project began with StyleGAN2, the dominant GAN architecture for high-quality image generation. The reasoning: StyleGAN2 generates photorealistic 1024x1024 faces from 70,000 aligned images (FFHQ). Surely it can generate 256x256 sculptures from 7,843 images.

It could not. All 17 variants suffered the same three problems:

  1. Mode collapse: The generator converges on producing the same image repeatedly, or a small cluster of near-identical outputs. The discriminator learns too quickly and the generator cannot recover.
  2. Discriminator dominance: With few training images, the discriminator memorizes the dataset and rejects everything the generator produces. The adversarial game becomes unwinnable.
  3. Unstable training: Loss values oscillate wildly. Checkpoints from adjacent epochs produce radically different quality levels. No reliable convergence trajectory.

What was tested across 17 variants

VariantModificationResult
Standard StyleGAN2Default hyperparametersMode collapse by epoch 20
ADA (Adaptive Augmentation)Data augmentation in discriminatorDelayed collapse, still fails
Reduced modelFewer channels, smaller mapping networkSame collapse with worse quality
Various learning rates1e-3 to 1e-5 for G and DSlower collapse, same outcome
Various batch sizes2, 4, 8, 12Batch 12 = VRAM limit. No improvement.
R1 regularizationDifferent gamma valuesMarginal stability improvement
Path length regularizationSmoother latent spaceNo visible effect

StyleGAN2 has approximately 30 million parameters. With 7,843 training images, that is ~3,800 parameters per training image. For comparison, FFHQ training has ~430 parameters per image. The model has too many degrees of freedom relative to the constraints provided by the data.

The 150 hours spent here were not wasted. They established a critical negative result: standard GAN architectures designed for large, homogeneous datasets cannot be scaled down to small, heterogeneous ones by simply adjusting hyperparameters. The fundamental architecture must change.

STYLEGAN2 ARCHITECTURE Latent z (512-dim) --> Mapping Network (8 FC layers) --> w (512-dim) | +-------- Style modulation (AdaIN) -----------+ | Synthesis: 4x4 --> 8x8 --> 16x16 --> 32x32 --> 64x64 --> 128x128 --> 256x256 [Noise injection at every layer] [Skip connections + progressive growing] Total parameters: ~30,000,000 Training images: 7,843 Params per image: ~3,800 (FFHQ baseline: ~430) Result: MODE COLLAPSE in all 17 configurations
"GAN-arkitekturer har fundamentalt problem med discriminator-dominans pa sma dataset. Behover byta till icke-adversarial approach."
-- EXP-001-017, conclusion, February 2026
Mode collapse across all 17 variants. The generator converges on producing the dataset mean (a blurry brownish-gray average). The discriminator memorizes the small dataset and rejects all generated samples. No hyperparameter configuration can compensate for the 7x worse parameter-to-data ratio compared to standard StyleGAN2 training.
A 30M-parameter model cannot learn meaningful structure from 7,843 heterogeneous images. The model-to-data ratio is more important than architectural sophistication. This lesson guided all subsequent experiments toward smaller models.
The Rescue: System Freeze and Honest Assessment (February 7, 2026) SYSTEM CRASH

Date: February 7, 2026, 23:45 | Event: Complete system freeze during training

During a latent diffusion training run, the system ran out of memory and froze completely. The screen went black. No keyboard response. The only option: hold the power button for 5 seconds and hope the last checkpoint was saved.

This was not the first freeze. The previous day (February 6) the system had also frozen at epoch 8 of a latent diffusion experiment, losing 1.2 hours of training. The cause: contour-aware loss combined with a large batch size consumed all 6 GB of VRAM simultaneously, triggering an out-of-memory condition that the Linux kernel could not recover from.

After reboot, a comprehensive rescue session was conducted. Every checkpoint was examined. Every component was honestly evaluated. The resulting document, RADDNINGSRAPPORT.md, is the most honest assessment in the project's history.

Honest assessment at the time of rescue

ComponentStatusHonest evaluation
VAE v3 (Vaeana)WorkingMSE 0.0128, 11M params, latent 8x32x32. Good structure, separation ratio 7.79.
NadjaEMBEDOvertrainedBest at epoch 20 (val_loss 2.88), got worse by epoch 100 (3.50). Only 14% recall.
Latent DiffusionPlateauedEpoch 46/150, val_loss 0.3491, no improvement since epoch 40.

What the model could do

  • Generate textures and material-like surfaces
  • Respond to material prompts: "glass" produced blue-transparent tones, "metal" produced metallic surfaces
  • Vary output intensity via CFG scale

What the model could NOT do

  • Generate objects or sculptures (no contours, no form)
  • Produce figure-ground separation (everything was texture-on-texture)
  • Avoid text artifacts at certain prompt configurations
  • Improve beyond epoch 40 (fundamental plateau)
"Projektet har natt en teknisk framgang men inte malet."
-- RADDNINGSRAPPORT.md, February 7, 2026, 23:45

The rescue report offered four possible paths forward:

  1. Accept as texture/mood generator -- immediately rejected (violates DEC-002)
  2. Curate dataset to 500-1000 focused images -- estimated 2-4 days
  3. Add pretrained backbone -- rejected (violates DEC-001, from-scratch principle)
  4. New architecture (Slot Attention) -- estimated 1-2 weeks

The freeze also led to new safety infrastructure: watchdog scripts that monitor GPU temperature, VRAM usage, and system RAM every 10 seconds, automatically killing the training process before the system can freeze. Maximum 5 automatic restarts. Pause functionality via a touch file. These scripts remained in use for every subsequent experiment.

System crashes are not just technical inconveniences. They force honest assessment. The rescue report was the first document that admitted "the project has achieved technical progress but not the goal." This honesty redirected the entire research trajectory toward progressive growing one week later.
Phase II: Diffusion Models (EXP-018 to EXP-022) FAILED

Period: February 6-13, 2026 | Duration: ~80 hours | Experiments: 5

After the GAN failures, the project turned to diffusion models, which learn to generate images by reversing a noise-adding process. Unlike GANs, diffusion models do not require adversarial training, which should make them more stable on small datasets.

EXP-018 Pixel Diffusion: A U-Net trained to predict noise in pixel space (256x256x3 = 196,608 dimensions). The model minimizes mean squared error between predicted and actual noise. Over 300 epochs, the model learned to predict the average noise, producing images that converge on the dataset mean: a washed-out brownish-gray blur with no spatial structure. This is mathematically expected. MSE minimization on heterogeneous data produces the statistical average.

EXP-019 CLIP Diffusion: Attempted to use CLIP embeddings for generation guidance. Immediately abandoned. CLIP is pretrained on 400 million internet image-text pairs. Using it would inject visual knowledge from outside Nadja's dataset, fundamentally violating the from-scratch principle. Even using CLIP for evaluation (FID scores require pretrained Inception) was rejected.

EXP-020 NadjaEMBED: A custom material classifier (11.8M params) trained from scratch on the dataset itself, then used to guide diffusion. The embedding learned weak material associations (glass vs metal: cosine similarity 0.028, correctly low; glass vs crystal: 0.318, correctly similar) but with only 14% recall. Too noisy to guide generation meaningfully. The outputs were still blurry textures without object structure.

EXP-021 SlotDiffusion: Slot attention decomposes a scene into separate object "slots" and reconstructs each independently. Combined with diffusion, it should learn object-level generation. However, slot attention was designed for synthetic datasets (CLEVR, MOVi) with clear, isolated objects on uniform backgrounds. Nadja's photographs are complex, multi-layered, and have no clear figure-ground boundary in most images. The model could not decompose anything meaningful.

EXP-022 Edge-First: Extract Sobel edge maps from all images, train a GAN on the edge maps (which are simpler), then colorize. The edge maps proved too sparse for GAN training. Most images have weak or diffuse edges, and the edge maps collapsed to near-empty images. A GAN trained on mostly-empty images produces empty images.

PIXEL DIFFUSION vs LATENT DIFFUSION PIXEL SPACE (EXP-018): Image (256x256x3) --[add noise]--> Noisy image --[U-Net]--> Predicted noise Output dimensions: 196,608 Problem: MSE over 196K dims with 7K images = learns the AVERAGE LATENT SPACE (attempted later): Image --[VAE Encoder]--> Latent (8x32x32) --[add noise]--> Noisy latent --[U-Net]--> Predicted noise Output dimensions: 8,192 (24x fewer) Still produces textures, not objects (the data problem persists) EDGE-FIRST (EXP-022): Image --[Sobel filter]--> Edge map --[GAN]--> Generated edges --[Colorize]--> Image Problem: Edge maps are mostly empty (sparse data for GAN training)
Pixel-space diffusion with MSE loss on heterogeneous data mathematically converges on the dataset mean. The model learns to predict average noise, producing average images. With 196,608 output dimensions and 7,843 training examples, the system is wildly underdetermined. Reducing to 8,192 latent dimensions helps computationally but does not solve the semantic problem: the data is still too heterogeneous for unconditional generation.
Diffusion models are more stable than GANs but face the same fundamental data problem. Stability does not equal quality when the data cannot constrain the output space. Moving to latent space (fewer dimensions) improves speed but not semantic learning. The problem is not the number of dimensions but the lack of structure in the training signal.
Phase III: Progressive GAN (EXP-023 to EXP-026) BREAKTHROUGH

Period: February 14-17, 2026 | Duration: ~15 hours | Experiments: 5

After 22 failed experiments and 230 hours of GPU time, a fundamentally different approach. Instead of generating 256x256 images in one shot, the model starts at 4x4 resolution and gradually increases: 4x4 -> 8x8 -> 16x16 -> 32x32 -> 64x64 -> 128x128. Each level trains until stable before the next is introduced.

This changes the learning problem entirely. At 4x4 (16 pixels), the model only needs to learn rough color distributions. At 8x8, basic spatial layout. At 16x16, coarse shapes. Each resolution level inherits stable features from the previous level and only needs to learn the additional detail at its own scale. The model never faces the full 256x256 generation problem at once.

EXP-023 Progressive L0-4 (up to 64x64): The first time the model generated recognizable FORMS. At level 4 (64x64), blobs with defined edges and spatial structure appeared. Not sculptures yet, but objects with figure-ground separation. After 22 experiments of nothing but noise, textures, and mode collapse, seeing a centered form with a defined boundary was a breakthrough moment.

EXP-025 Progressive L5 (128x128): THE breakthrough. After ~4 hours of training (50 epochs at level 5, WGAN-GP loss), the model produced outputs with seven identifiable properties:

  1. Document-like patterns: text-like structures (learned from the screenshots and documents in the dataset)
  2. Crystalline forms: blue-green organic structures resembling glass
  3. Figure-ground separation: IT WORKS. Objects against backgrounds.
  4. Color richness: Nadja's full palette (not collapsed to gray)
  5. Sharpness: dramatically better than 64x64
  6. Compositional variation: different layouts per sample
  7. Organic forms: biological/natural-looking structures
"DETTA AR SKULPTURER enligt Nadjas definition!"
-- EXP-025-progressive-L5.md, February 15, 2026

The Progressive GAN uses approximately 4.5 million parameters (vs 30M for StyleGAN2). That is ~574 parameters per training image, a 6.6x improvement in the parameter-to-data ratio compared to StyleGAN2.

Training metrics showed healthy GAN dynamics: G-loss improved from 49.6 to 40.6 (18% reduction), D-loss stabilized at -5.9 (WGAN-GP target range). No mode collapse. No discriminator dominance.

EXP-026 Progressive L6 (256x256): Attempted to scale beyond 128x128. Result: NaN losses required layer freezing to stabilize. The output at 256x256 was "abstract blobs" -- the additional resolution did not add meaningful structure. Higher resolution requires more data or more targeted conditioning. The conclusion: 128x128 is the resolution ceiling for this dataset with unconditional generation.

PROGRESSIVE GROWING ARCHITECTURE Level 0: z (512) --> [Dense 4x4x512] --> 4x4 (16 pixels, ~3 min) Level 1: ... --> [UpConv + Conv] --> 8x8 (64 pixels, ~3 min) Level 2: ... --> [UpConv + Conv] --> 16x16 (256 pixels, ~3 min) Level 3: ... --> [UpConv + Conv] --> 32x32 (1,024 pixels, ~5 min) Level 4: ... --> [UpConv + Conv] --> 64x64 (4,096 pixels, ~20 min) ^--- first forms appear here Level 5: ... --> [UpConv + Conv] --> 128x128 (16,384 pixels, ~4 hours) ^--- BREAKTHROUGH: sculpture-like output Loss: WGAN-GP (Wasserstein + gradient penalty) Parameters: ~4,500,000 (6.6x less than StyleGAN2) Params/image: ~574 (vs 3,800 for StyleGAN2) KEY: Each level trains to stability BEFORE the next is added. The model never faces the full generation problem at once.
Progressive growing succeeds because it transforms an intractable problem (generate 128x128 from noise) into a sequence of tractable sub-problems (generate 4x4, then add detail to 8x8, then to 16x16...). This is structurally analogous to biological vision development and to how human artists build images: rough composition first, then refinement. The curriculum IS the architecture.
Phase IV: Conditional GAN / AC-GAN (EXP-027 to EXP-035) FAILED

Period: February 18-22, 2026 | Duration: ~35 hours | Experiments: 9

With progressive GAN producing forms, the next question: can we steer the output? Can we ask for "glass" or "sculpture" specifically?

Two main approaches were tried: class-conditional GAN (add category labels to both generator and discriminator) and AC-GAN (Auxiliary Classifier GAN, where the discriminator also classifies images by material category).

EXP-027-030 Conditional GAN (4 variants): Added material category labels as conditioning input. The model received both a noise vector and a class embedding. Result: categories melt together. With only 67-523 images per material class, there are insufficient examples to define clear class boundaries. The generator produces compromises between classes, blending glass-like transparency with ceramic-like opacity into formless intermediate states.

EXP-031 Conditional from scratch: Simplified to binary conditioning (is_sculpture: 0/1). Partial success -- the model responds to the conditioning signal, but the binary label is too coarse to produce meaningfully different outputs.

EXP-032-035 AC-GAN (4 variants): The discriminator has a dual task: (1) classify real vs fake, and (2) classify material category. This provides a classification gradient to the generator. Result: discriminator dominance. The discriminator learns to classify perfectly (both real/fake and category) but the classification head creates conflicting gradients for the generator. The generator cannot simultaneously maximize realism AND match specific material categories with so few examples per class.

A 32-dimensional feature vector (color statistics, edge density, symmetry, texture measures) was also tested as a conditioning signal. Maximum measured effect on output: 0.45 on a 0-1 scale. Not enough to steer the generation meaningfully. The features do not capture glass-semantics.

Conditional generation on small datasets is counterproductive. With 67-523 images per class, the model has insufficient examples to learn class-specific features. The discriminator memorizes the per-class statistics and dominates training. The generator produces inter-class compromises that satisfy neither the adversarial loss nor the classification loss.
Conditioning requires enough examples per condition to define clear decision boundaries. A binary signal (sculpture / not-sculpture) is better than fine-grained multi-class labels on a small dataset. Complex 32-dimensional features are worse than simple binary labels because they introduce 32 weakly-correlated dimensions of noise into the conditioning signal.
Phase V: VQ-VAE (EXP-036 to EXP-038) SUCCESS

Period: February 22-24, 2026 | Duration: ~10 hours | Experiments: 3

A strategic pivot: instead of trying to generate images directly, first learn a discrete vocabulary of visual tokens. A Vector Quantized Variational Autoencoder (VQ-VAE) compresses images into a grid of discrete codes from a learned codebook. This separates the representation problem (how to encode images) from the generation problem (how to produce new ones).

EXP-036 VQ-VAE 32x32 latent: Input 128x128, compressed to 32x32 grid of codebook indices. 512-entry codebook, each entry 256-dimensional. Result: works, but loses fine detail (4x spatial downsampling).

EXP-037 VQ-VAE 64x64 latent: Same architecture but 2x downsampling only. 4,096 tokens per image. Better reconstruction, preserves more detail.

EXP-038 Full dataset + augmentation: Trained on all 7,843 images with RandomCrop, Flip, ColorJitter, and Rotation. 1,000 epochs. Result: a richer, more diverse codebook. Better "visual vocabulary."

Across two major training runs (vqvae and vqvae_v2), the model accumulated 348 checkpoints and converged to a total loss of 0.0009:

  • Reconstruction loss: 0.0007 (near-pixel-perfect reconstruction)
  • VQ commitment loss: 0.0002 (codebook is being used efficiently)

The VQ-VAE produces near-perfect reconstructions. Top row is original, bottom row is the compressed-then-decompressed image. The model successfully learned a visual vocabulary for Nadja's images.

VQ-VAE ARCHITECTURE ENCODE: Image (256x256x3) --> Encoder CNN --> Continuous features (64x64x256) QUANTIZE: Continuous --> Find nearest codebook entry --> Discrete tokens (64x64 grid) DECODE: Discrete tokens --> Lookup embeddings --> Decoder CNN --> Reconstructed image Codebook: 512 entries, each 256-dimensional Compression: 196,608 continuous values --> 4,096 discrete token IDs (48x reduction) Training loss = reconstruction_loss + beta * commitment_loss Final: 0.0007 + 0.0002 = 0.0009 KEY: No adversarial training. No mode collapse. Stable convergence. The encoder/decoder are DETERMINISTIC once trained. Generation is deferred to a separate model (Phase VI).
VQ-VAE succeeds because it decomposes representation learning from generation. The encoder/decoder learn a bidirectional mapping between pixels and discrete tokens using only reconstruction loss (no adversarial component). This is a well-posed optimization problem with a clear global minimum. The generation problem is deferred: once you have a good tokenizer, you "only" need to model token distributions. Unfortunately, as Phase VI showed, this "only" hides a new set of problems.
Phase VI: Token Transformer / MaskGIT (EXP-039 to EXP-044) FAILED (~80 HOURS)

Period: March 6-24, 2026 | Duration: ~80 hours | Experiments: 6

With a working VQ-VAE tokenizer (Phase V), the plan was elegant: train a transformer to generate token sequences, then decode them back to images with the VQ-VAE decoder. This is the architecture behind DALL-E 1, Parti, and MaskGIT. The two-stage approach (tokenize, then model tokens) is the dominant paradigm in modern image generation.

Token Transformer (EXP-039, EXP-040)

EXP-039 Token Transformer V1: An autoregressive transformer (98 checkpoints trained) that predicts the next token in a raster-scan sequence (left-to-right, top-to-bottom). Established a stable baseline. The model learned token statistics, but generated images showed raster bias: horizontal stripe patterns from the left-to-right generation order. Every row starts independently.

EXP-040 Token Transformer V2: Attempted to fix raster bias with 2D positional encoding and modified attention. The bias remained, now expressed as a grid pattern. The fundamental issue: autoregressive generation imposes an ordering that images do not have. Pixels (and tokens) have spatial relationships, not sequential ones.

MaskGIT (EXP-041 to EXP-044)

MaskGIT solves the ordering problem by generating all tokens simultaneously: mask random tokens, predict them all at once (like BERT for images), keep the most confident predictions, re-mask the rest, and iterate. No raster scan, no ordering bias.

Four MaskGIT variants were tested over ~40 hours:

IDVariantTokensLossResult
EXP-041Standard MaskGIT (3.7M params)5123.4Scattered white dots on white
EXP-042MaskGIT 128x1285123.4Same dots. But inpainting works!
EXP-043CTF (Clustered Token Flow, 32 clusters)320.85Mode collapse to gray blocks
EXP-044CTF balanced sampling321.06Light/dark stripe patterns

Root cause analysis

A comprehensive post-mortem (ANALYS_MASKGIT.md, March 24, 2026) identified four root causes:

  1. Catastrophically imbalanced token distribution: The VQ-VAE codebook has 512 entries, but just 4 entries account for 76.5% of all tokens. These 4 codes represent background/empty space/neutral tones. When the transformer predicts "what token goes here?", the statistically correct answer is almost always one of these 4 background codes. Unconditional generation therefore fills the entire grid with background.
  2. Too many tokens (16,384 per image): At 128x128 with codebook size 512, each image becomes a 128x128 grid. Window attention (required for memory) cannot see global context. Global attention would require 268M attention entries -- OOM on 6 GB VRAM.
  3. Too little data (6,806 images): With 6,806 heterogeneous images, the transformer has insufficient examples to learn meaningful spatial token co-occurrence patterns.
  4. Unconditional generation is mathematically underdetermined: Without any conditioning signal (text prompt, class label, partial image), the model must generate "a plausible image" from nothing. With heterogeneous data, the most plausible output is the most common output: background.

Critical discovery: In EXP-042, MaskGIT inpainting worked. When given a real image with some tokens masked, the model filled in the masked regions coherently. This proves the model learned meaningful token relationships. But without any initial context (unconditional), it defaults to the majority class.

MASKGIT TOKEN DISTRIBUTION PROBLEM Codebook usage across 6,806 training images: Token ID Usage 47 ████████████████████████████ 22.1% \ 198 ██████████████████████ 18.3% | 76.5% of ALL tokens 12 ██████████████████ 16.8% | = 4 background codes 304 █████████████████ 15.3% / ... ▏ 0.1% -- 500+ rare codes ... ▏ 0.05% When the model asks "what is the most likely token here?" The answer is ALWAYS one of these 4 codes. Unconditional generation = fill everything with background.
"Avsluta MaskGIT-sparet. Atervand till Progressive GAN."
-- ANALYS_MASKGIT.md, March 24, 2026, after 40 hours of MaskGIT experiments
"Efter 41 experiment och 333 timmar: Arkitektur loser INTE dataproblem. 7,000 heterogena bilder + from scratch + unconditional = matematiskt underbestamt."
-- EXP-041, the central insight of the entire project
Token-based generation requires balanced token distributions. When 4 of 512 codebook entries dominate 76.5% of all tokens, the transformer's maximum-likelihood prediction is always background. A working encoder does not guarantee a working generator. Inpainting works (context provides signal) but unconditional generation fails (no signal to bootstrap from).
The two-stage paradigm (tokenize then model) works at internet scale (DALL-E: 250M images, Parti: 1B images) where token distributions are diverse and roughly balanced. At the scale of 7,000 heterogeneous images with an unbalanced codebook, the token distribution is too skewed for a transformer to learn meaningful unconditional generation. The codebook must be rebalanced, or generation must be conditioned.
Phase VII: Return to Progressive GAN (EXP-045 to EXP-046) CONFIRMED BEST

Period: March 24-25, 2026 | Duration: ~5 hours | Experiments: 2

After the MaskGIT failure analysis, a deliberate return to the architecture that actually worked. The question: were the Phase III results reproducible, or a lucky fluke from a stochastic process?

EXP-045 Progressive GAN (return): Re-ran the progressive GAN pipeline from scratch. In approximately 10 minutes, the model confirmed: progressive growing still produces the best results of any architecture tested. The result was not a fluke. The architecture reliably generates structured forms on this dataset.

EXP-046 Progressive L5 continuation: Continued training from the Level 5 checkpoint (epoch 50) through epoch 100. Training ran from 22:10 on March 24 to 02:43 on March 25 (4.5 hours). GPU temperature peaked at 71C. VRAM stayed between 1.6-2.5 GB. Zero crashes.

The discriminator loss stabilized around -10 from epoch 65 onward, indicating the adversarial game had reached equilibrium. The 50 additional epochs produced visible improvements:

  • More defined object boundaries and contours
  • Richer surface textures (metallic, organic, glass-like)
  • Better figure-ground separation with darker/neutral backgrounds
  • Greater compositional diversity across samples
  • Some images contain recognizable sculptural elements: vessels, cast forms, assembled objects

However, diminishing returns were clear from epoch 65. The root problem (heterogeneous dataset, unconditional generation) cannot be solved with more training. The model has learned what it can learn from this data at this resolution.

The progressive GAN is not merely the best architecture tested. It is the only architecture that produces outputs matching the structural definition of sculpture: centered forms with defined edges and figure-ground separation. This is because progressive growing is the only approach that decomposes the generation problem into a learnable curriculum at this data scale. The curriculum IS the solution.
Phase VIII: Parallel Explorations MIXED RESULTS

Throughout the project, several side experiments explored alternative approaches. These were not part of the main trajectory but provided additional insights and, in some cases, the most visually compelling outputs of the entire project.

Lightweight / Turbo variants

Speed-optimized progressive GANs with aggressively reduced parameter counts. The "turbo" variant produced some of the most striking outputs: biomorphic, organic forms with glass-like translucency, clear figure-ground separation on neutral gray backgrounds, and sculptural presence. These forms resemble biological specimens, cast glass objects, and assembled sculptures. The reduced parameter count may have acted as regularization, preventing the model from memorizing dataset-specific textures and instead learning structural primitives.

PixelGPT

Autoregressive pixel-by-pixel generation at 64x64. The model predicts each pixel value conditioned on all previous pixels (in raster order). Result: learns color distributions but not spatial structure. Each row begins a new color field, creating horizontal stripe patterns. A color palette analysis showed the model learned Nadja's color vocabulary (amber, teal, gray, black, white) without learning how to arrange it spatially.

Structured Autoencoder

Autoencoder with structural constraints (edge consistency, contour preservation). Produces smooth interpolations between training images in latent space. Useful for understanding the latent manifold but not for novel generation. The interpolations traverse a continuous space of training-image-like textures.

Flow Matching

A modern alternative to diffusion that learns direct deterministic trajectories from noise to data (instead of stochastic diffusion/denoising). At epoch 20, produces ghostly, semi-transparent compositions with layered spatial structure. The outputs have an ethereal quality distinct from any other architecture. Undertrained (only 20 epochs) but promising for future exploration.

Progressive Isolated / Curated

Progressive GAN trained on curated subsets: isolated objects (4,133 images) and strictly curated sculptures (300-917 images). The isolated dataset produced cleaner figure-ground separation. The curated datasets produced more coherent objects but with reduced diversity. Dataset curation directly controls the trade-off between diversity and structural coherence.

Sometimes the most visually interesting outputs come from architectures that were not the "best" by any metric. The Lightweight Turbo model, with fewer parameters and faster training, produced outputs with more sculptural presence than the larger models. Constraints (fewer parameters, less capacity) can force a model to learn structural primitives instead of memorizing surface textures.
The Seven Decisions: principles that guided 400 hours of research

Throughout the project, seven formal decisions were documented. Each was written as a response to a specific crisis or insight, and each remained active for the remainder of the project. Together they define the philosophical framework within which all experiments operated.

DEC-001 | February 15, 2026 | ACTIVE
No pretrained bias. Everything from scratch.
Forbidden: Fine-tuning Stable Diffusion. LoRA/DreamBooth. Pretrained encoders (CLIP, DINO, Inception). Transfer learning from ImageNet. Any model trained on external datasets. Even FID evaluation (requires pretrained Inception) was rejected. The model must learn exclusively from Nadja's 7,843 images. This constraint increases difficulty enormously but ensures that every generated texture, color, and form comes from the artist's own visual world.
DEC-002 | February 15, 2026 | ACTIVE | NON-NEGOTIABLE
Never accept "mood generator" as an answer.
The most absolute rule. Formalized after the rescue session (February 7) where the honest assessment was that the model produces textures, not sculptures. The decision: it does not matter how many experiments fail. It does not matter how "impossible" it seems. Continue research until it works. Forbidden suggestions: "Accept the limitation", "Use as texture generator", "Lower expectations", "It's impossible." This decision was tested repeatedly (after 17 GAN failures, after 4 MaskGIT failures) and held each time.
DEC-003 | February 22, 2026 | ACTIVE
Never repeat an experiment.
Before any new experiment: read the full experiment index, verify the approach has not been tried, and identify what is fundamentally new. "Same architecture with lower learning rate" (already tested 5+ times) is not acceptable. "Same architecture with more epochs" provides no new insights. Time is the most valuable resource. Every experiment must test a genuinely novel hypothesis.
DEC-004 | February 22, 2026 | ACTIVE (superseded in practice by EXP-045)
Token Transformer is the main research track.
After testing Simple AE, VAE, GAN, and Diffusion, the Token Transformer was identified as the most promising direction: Tokenizer (VQ-VAE) -> Transformer (predict tokens) -> Detokenizer (VQ-VAE decoder). This decision was formally maintained but in practice superseded after the MaskGIT failures (Phase VI) led to the return to Progressive GAN (Phase VII).
DEC-005 | March 6, 2026 | ACTIVE
Two-phase training strategy.
Phase 1 (epochs 0-50): Frozen encoder, only decoder trains, lower learning rate, focus on reconstruction quality. Phase 2 (epochs 50+): Unfrozen encoder, both train simultaneously, gradually increasing complexity, focus on generation. This prevents the encoder from being destabilized by early noisy generator outputs.
DEC-006 | March 15, 2026 | ACTIVE | CRITICAL REDEFINITION
Sculpture is defined structurally, not semantically.
The previous approach tried to classify materials (glass, metal, ceramic) -- a semantic definition. Nadja's own definition is structural: a sculpture is (1) a centered object, (2) with defined edges, (3) with figure-ground separation. This redefinition meant that EXP-025 (Progressive GAN L5) was retroactively recognized as a SUCCESS: it produces centered forms with edges and figure-ground separation, regardless of material classification. Dataset curation should focus on structure (centered objects), not semantics (material type).
DEC-007 | March 15, 2026 | ACTIVE | THE CENTRAL INSIGHT
Architecture does NOT solve data problems.
After 41 experiments across all major architecture families: 7,000 heterogeneous images + from-scratch training + unconditional generation = mathematically underdetermined. Switching architectures cannot fix this. The prescription: (1) improve dataset quality through curation, (2) add structural conditioning (figure-ground signals), (3) accept resolution limits (128x128 works, 256x256 does not with this data). Stop treating architecture changes as the solution to data problems.
Complete Experiment Log: 46 experiments, 400 hours
IDDateExperimentArchitectureTimeStatusKey finding
001-017Nov '25 - Feb '26StyleGAN2 variantsStyleGAN2~150hFAILEDAll mode-collapsed. 30M params too many.
0182026-02-06Pixel DiffusionU-Net Diffusion~28hFAILEDMSE averages to dataset mean
0192026-02-07CLIP DiffusionCLIP + Diffusion~1hABANDONEDViolates from-scratch principle
0202026-02-08NadjaEMBEDCustom Embed + Diffusion~20hFAILED14% recall, still blurry
0212026-02-08SlotDiffusionSlot Attention + Diffusion~15hFAILEDDesigned for synthetic data
0222026-02-09Edge-FirstEdge GAN + Colorize~16hFAILEDEdge maps too sparse for GAN
0232026-02-14Progressive L0-4Progressive GAN~3hSUCCESSFirst forms at 64x64!
0242026-02-15Progressive + ADAProgressive + ADA~2hABANDONEDADA augmentation not needed
0252026-02-15Progressive L5Progressive GAN 128x128~4hBEST RESULTBREAKTHROUGH: sculpture forms!
0262026-02-16Progressive L6Progressive GAN 256x256~6hPARTIALNaN + abstract blobs. 256x256 too hard.
027-030Feb 18-21Conditional GANcGAN~20hFAILEDCategories melt together
0312026-02-21Conditional scratchBinary cGAN~5hPARTIALResponds to signal but weak effect
032-035Feb 21-22AC-GAN variantsAC-GAN~10hFAILEDDiscriminator dominance
036-038Feb 22-24VQ-VAEVQ-VAE~10hSUCCESSLoss 0.0009. Working tokenizer.
0392026-03-06Token Transformer V1Autoregressive Transformer~30hSUCCESSStable baseline (but raster bias)
0402026-03-09Token Transformer V22D Transformer~10hPARTIALGrid patterns (raster bias persists)
041Mar 11-15MaskGITMaskGIT (3.7M)~15hFAILEDScattered dots (token imbalance)
0422026-03-24MaskGIT 128x128MaskGIT~8hFAILEDInpainting OK, unconditional = dots
0432026-03-24CTF MaskGITMaskGIT + 32 clusters~10hFAILEDMode collapse to gray
0442026-03-24CTF balancedMaskGIT + balanced~7hPARTIALLight/dark patterns
0452026-03-24Progressive (return)Progressive GAN~10minCONFIRMEDStill the best after 400 hours
046Mar 24-25Progressive L5 cont.Progressive GAN~4.5hSUCCESSEpoch 100. Diminishing returns from 65.

Architecture families tested

ARCHITECTURE FAMILY TREE ADVERSARIAL (GAN-based): +-- StyleGAN2 (17 variants) --> ALL FAILED (mode collapse) +-- Progressive GAN --> BREAKTHROUGH (curriculum learning) +-- Conditional GAN (4 variants) --> FAILED (categories melt) +-- AC-GAN (4 variants) --> FAILED (D dominance) NON-ADVERSARIAL: +-- Pixel Diffusion (U-Net) --> FAILED (MSE = average) +-- Latent Diffusion (VAE + U-Net) --> FAILED (textures only) +-- Edge Diffusion (Sobel + U-Net) --> FAILED (sparse data) +-- SlotDiffusion (Slot Attention + Diff) --> FAILED (synthetic data assumption) +-- Flow Matching --> PROMISING (undertrained) TWO-STAGE (Tokenize + Generate): +-- VQ-VAE encoder/decoder --> SUCCESS (tokenizer works) +-- Token Transformer (autoregressive) --> PARTIAL (raster bias) +-- MaskGIT (masked prediction) --> FAILED (token imbalance) AUTOENCODER (reconstruction-based): +-- Simple AE --> FAILED (blurry) +-- Structured AE --> PARTIAL (interpolation only) +-- VAE v3 (Vaeana) --> SUCCESS (but unused) HYBRID: +-- NadjaEMBED + Diffusion --> FAILED (14% recall) +-- PixelGPT (autoregressive pixels) --> FAILED (no spatial structure) +-- Lightweight / Turbo variants --> PARTIAL (visually striking)
46 experiments | ~400 hours | 7,843 images | 1 breakthrough GTX 1660 Super (6 GB) From scratch