I've been building AI image generation workflows for a while now. Training LoRAs, comparing checkpoints, wiring up ComfyUI nodes.
And the whole time, I kept needing to look up the same terms. What's the difference between CFG and Denoise? Why does everyone say "safetensors"? What even is a VAE?
So I wrote the dictionary I wish existed when I started.
This isn't a tutorial. It's a reference. Bookmark it. Come back when you hit a term you don't recognize. Everything is organized in layers — from "I just heard about Stable Diffusion" to "I'm training my own models."
Layer 1: The Absolute Basics
Stable Diffusion (SD)
An open-source AI image generator. You type a description of what you want, and it creates an image.
Similar to MidJourney or DALL-E, but with one massive difference: you run it on your own machine. No subscription. No content filters. No one else's rules.
Prompt
The text instruction you give the AI. The more detailed, the better.
- Prompt: "a Japanese woman in a white dress standing under cherry blossoms, golden hour lighting"
- Negative prompt: what you don't want — "blurry, deformed hands, low quality"
Think of it as giving directions to a painter. Vague directions get vague results.
ComfyUI
The interface you use to operate Stable Diffusion. It's a node-based visual workflow — you connect function blocks together like a flowchart.
The other common interface is WebUI (A1111), which is simpler but less flexible.
ComfyUI is where the SD community is heading. Most new tools and workflows are built for it.
Layer 2: The Model Zoo
This is where it gets interesting. Open CivitAI and you'll see dozens of model types. Here's what each one actually does.
Checkpoint (Base Model)
The single most important thing in Stable Diffusion.
A checkpoint is the AI's trained "brain." It determines the fundamental style and capability of every image you generate.
- Files are big — typically 6–12 GB
- Examples: Z-Image Turbo (ZIT), Flux, SDXL
- Different checkpoints = completely different art styles
Analogy: A checkpoint is a fully trained painter. One painter does photorealism. Another does anime. You pick the painter first, then give them instructions.
How Checkpoints Are Made
| Method | What it means |
|---|---|
| Trained | Built from scratch or fine-tuned from a base model. Requires serious GPU time |
| Merged | Two or more checkpoints mathematically blended together. No GPU training needed |
Merged checkpoints are surprisingly effective. For example, Moody Real Mix blends multiple ZIT models to get a more photorealistic look — no training required.
LoRA (Low-Rank Adaptation)
A small plugin that teaches an existing checkpoint something new.
Want the AI to generate a consistent character? Train a LoRA on that character's face. Want a specific art style? There's probably a LoRA for that.
- Files are small — tens to hundreds of MB
- Swap them in and out freely
- Must match the base checkpoint (a ZIT LoRA won't work on Flux)
Analogy: If the checkpoint is the painter, the LoRA is a stack of reference photos you hand them. "This is what this person looks like. Now paint them in different scenes."
LyCORIS / LoKR / DoRA
Variants of LoRA that use different math under the hood.
| Name | What it is |
|---|---|
| LoKR | Uses Kronecker products. Sometimes better for specific use cases |
| DoRA | Newer variant. Potentially higher quality, but slower |
| LyCORIS | A framework that includes LoHa, LoKR, and other variants |
Practical advice: Standard LoRA works great for most people. Don't worry about these unless you're deep into training experiments.
VAE (Variational Autoencoder)
The AI doesn't paint pixels directly. It works in a compressed mathematical space, then the VAE translates that math back into a visible image.
Different VAEs affect color and brightness. Most checkpoints come with a built-in VAE, but you can swap them.
Analogy: The checkpoint paints the picture. The VAE develops the film.
Embedding (Textual Inversion)
A tiny file that teaches the AI a specific concept.
Smaller and simpler than LoRA, but also more limited. The most common use is negative embeddings — like a "bad-hands" embedding that tells the AI what ugly hands look like so it avoids them.
Analogy: If LoRA is a photo album, an embedding is a sticky note.
Hypernetwork
An older method for fine-tuning models. Mostly replaced by LoRA. You'll see it mentioned in old tutorials, but you can safely ignore it.
Aesthetic Gradient
Another older technique for teaching the AI what "good-looking" means. Rarely used in 2026.
ControlNet
Gives you precise control over composition.
Draw a stick figure pose → the AI generates a realistic person in that exact pose. Feed it an edge map → the AI fills it in with detail.
| ControlNet Type | What it controls |
|---|---|
| OpenPose | Body pose and hand position |
| Canny | Edge lines and outlines |
| Depth | 3D depth of the scene |
| Tile | Preserves existing detail while upscaling |
Analogy: The checkpoint is the painter. ControlNet is you physically posing the model before the painter starts.
Upscaler
Takes a small AI-generated image and enlarges it with added detail.
AI typically generates at 1024×1024. Upscalers can push that to 2048 or beyond — and they actually add detail rather than just stretching pixels.
Motion Models
For video generation. AnimateDiff turns static images into short animations. WAN 2.1 is a dedicated video generation model.
This space is evolving fast.
Wildcards
Random substitution in prompts.
{red|blue|green} dress → randomly picks a color each time. Great for batch-generating variations without manually changing the prompt.
Poses
Pre-made body position data for ControlNet. Think of them as pose presets.
Detection Models
Models that find specific things in images. Example: face_yolov8m.pt detects face positions.
Used for inpainting (redrawing just the face) or face swap workflows.
Workflows
ComfyUI's saved configurations (JSON files). Someone shares a workflow, you import it, and their entire node setup appears in your editor.
This is one of ComfyUI's killer features — you can share and reproduce exact generation pipelines.
Layer 3: File Formats
Not all model files are created equal.
| Format | Extension | Safety | Status |
|---|---|---|---|
| SafeTensors | .safetensors |
Safe — cannot contain executable code | Current standard. Use this. |
| PickleTensor |
.pt, .ckpt
|
Unsafe — can contain malicious Python code | Legacy. Avoid if possible |
| GGUF | .gguf |
Safe | New. Smart compression from the LLM world |
| Diffusers | (folder) | Safe | HuggingFace's format. Multiple small files |
| Core ML | .mlmodelc |
Safe | Apple Silicon only |
| ONNX | .onnx |
Safe | Cross-platform but rare in SD |
The Important Ones
SafeTensors is the only format you should download. Period. If a model is only available as .ckpt, make sure you trust the source.
GGUF is worth watching. It brings smart compression from the LLM world — keeping important weights at high precision while compressing less critical ones. This lets lower-VRAM GPUs run bigger models with less quality loss than traditional quantization.
Layer 4: Precision (What FP16 / FP8 / BF16 Mean)
Every number in a model is stored at a certain precision — how many bits represent each value. More bits = more accurate = bigger file = more VRAM.
| Precision | Bits | File Size | Quality | Who uses it |
|---|---|---|---|---|
| FP32 | 32 | Huge | Perfect | Almost nobody. Way too large |
| FP16 | 16 | Half of FP32 | Excellent | The standard for inference |
| BF16 | 16 | Same as FP16 | Excellent | Better for training |
| FP8 | 8 | Half of FP16 | Very good | For VRAM-limited GPUs (e.g., 24GB cards) |
Quantization
The process of converting a model from higher precision to lower precision.
FP16 → FP8 = quantization. You save VRAM at the cost of slightly reduced quality.
Analogy: FP16 is the original photo. FP8 is a high-quality JPEG. You can tell the difference if you zoom in, but for most purposes it's fine.
Layer 5: Generation Settings
These are the numbers you tweak every time you generate an image. Understanding them is the difference between random results and intentional ones.
CFG (Classifier-Free Guidance)
Controls how strictly the AI follows your prompt.
The AI runs two parallel paths: one guided by your prompt, one completely unguided. CFG is the blend ratio.
| CFG Value | Behavior |
|---|---|
| 1 | Almost no guidance. The AI does whatever it wants |
| 5–7 | Balanced. Follows your direction but has creative freedom |
| 10–12 | Strict. Closely matches your prompt |
| 15+ | Over-constrained. Images become oversaturated and distorted |
Why do some models use CFG 1? Distilled models (like ZIT Turbo) already "know" what to paint. They don't need guidance — the knowledge is baked in. Regular models need CFG 5–12 to be pushed in the right direction.
Analogy: CFG is the volume of your voice when talking to the painter. Whisper (low CFG) = "paint whatever you feel." Shout (high CFG) = "exactly what I said, nothing else."
Steps
The AI doesn't generate an image in one shot. It starts with pure noise and removes a little noise each step.
More steps = more refinement. But there's a ceiling — after a certain point, more steps don't help.
| Model type | Typical steps |
|---|---|
| ZIT Turbo (distilled) | 8 steps |
| Regular models | 20–50 steps |
| Flux Klein | ~28 steps |
Sampler
The algorithm that decides how to remove noise at each step.
| Sampler | Personality |
|---|---|
| Euler | Simple, fast, reliable. Great starting point |
| Euler a | Euler + randomness. More variety between generations |
| DPM++ 2M | Looks back at previous steps to adjust. Stable and high quality |
| DPM++ 2M Karras | DPM++ with Karras scheduling. Better detail. Popular with SDXL |
| DPM++ SDE | Adds random perturbation. More artistic feel |
| res_multistep | Designed for distilled/turbo models. Multiple micro-adjustments per step |
Don't overthink it. DPM++ 2M Karras is a safe default for most models.
Scheduler
If the sampler decides how to walk, the scheduler decides how far to walk each step.
You need to reduce noise from 100% to 0% over N steps. How do you distribute that?
| Scheduler | Strategy |
|---|---|
| Simple/Linear | Equal steps: 100 → 97 → 94 → 91... Even and predictable |
| Karras | Big steps early, tiny steps late. Rough in fast, then refine detail |
| Beta | Like Karras but more aggressive. Biggest steps in the middle |
| Exponential | Exponential decay |
Why does ZIT use Simple? With only 8 steps, there's no room for fancy scheduling. Just split it evenly and go.
Karras is the default for most non-distilled models — the "sketch first, refine later" approach makes sense when you have 30+ steps to work with.
Seed
A random number that determines the starting noise pattern.
- Same seed + same settings = identical image
- Different seed = completely different image
Important caveat: Seeds don't guarantee consistency across different prompts. Change the prompt and the face changes — even with the same seed.
Denoise Strength
Only used in img2img and inpainting (when you already have an image).
Controls how much of the original image to keep vs. how much to regenerate.
| Denoise | What happens |
|---|---|
| 0.0 | No change at all. Pointless |
| 0.2–0.3 | Minor touch-ups. Fix lighting, small details |
| 0.4–0.5 | Moderate changes. Face swaps, hair color changes |
| 0.6–0.7 | Major changes. Only the rough composition survives |
| 0.8–1.0 | Basically a new image. Original is just a loose reference |
Analogy: Denoise is the size of the eraser you hand the painter. Small eraser = fix a detail. Big eraser = wipe the canvas and start over.
CFG vs. Denoise — The Most Confusing Pair
These two get mixed up constantly. Here's the difference:
| CFG | Denoise | |
|---|---|---|
| Controls | How much the prompt matters | How much of the original image to keep |
| Used in | All generation | Only img2img / inpainting |
| High value = | Follows prompt strictly | Changes more of the image |
| Low value = | AI has creative freedom | Preserves the original |
Resolution
The pixel dimensions of your output. Common sizes:
- 1024×1024 — Square, general purpose
- 896×1152 — Portrait orientation, good for people
- 1216×832 — Landscape
Every model has an optimal resolution range. Going too far outside it causes problems (weird proportions, repeated patterns).
Layer 6: Distilled vs. Non-Distilled Models
This is the concept that unlocks why some models are 10× faster than others.
What Is Distillation?
A regular model needs 30–50 steps to generate an image. Each step is a full computation cycle. That's slow.
Distillation trains a smaller "student" model to replicate a larger "teacher" model's results — but in far fewer steps.
It's like studying for an exam. The teacher solves problems in 50 careful steps. The student watches the teacher, learns the shortcuts, and arrives at the same answer in 8 steps.
The Comparison
| Non-Distilled (Original) | Distilled (Turbo/Schnell/Lightning) | |
|---|---|---|
| Steps | 20–50 | 4–8 |
| Speed | Slow | 3–10× faster |
| Quality | Highest (theoretically) | Very close. Sometimes slightly lower |
| CFG | Needs 5–12 | Usually just 1 (guidance baked in) |
| Flexibility | High — lots of knobs to turn | Low — use the recommended settings |
| LoRA training | Straightforward | Needs a Training Adapter (de-distillation) |
Common Pairs
| Teacher (Original) | Student (Distilled) |
|---|---|
| Z-Image Base (ZIB) | Z-Image Turbo (ZIT) |
| Flux.1 Dev | Flux.1 Schnell |
| SDXL Base | SDXL Turbo / SDXL Lightning |
Why Is LoRA Training Harder on Distilled Models?
Distillation compresses the model. Training a LoRA directly on a compressed model breaks its few-step inference ability.
The solution: a Training Adapter that temporarily "decompresses" the model during training. Once training is done, you throw away the adapter — the LoRA works with the distilled model just fine.
Which Should You Choose?
- Need speed? → Distilled. A few seconds per image. Great for iteration and batch work
- Need maximum quality? → Non-distilled. Slower, but potentially better
- Best approach? → Use both. Distilled for quick drafts and exploration. Non-distilled for final polished output
Layer 7: Base Model Families
These are the "platforms" of the SD world. Like choosing between iOS and Android — each has its own ecosystem.
The Major Players (Early 2026)
| Family | Developer | Strengths | Status |
|---|---|---|---|
| Z-Image (ZIT/ZIB) | Alibaba Tongyi | Ultra-fast (8 steps), excellent at Asian faces | Rising star |
| Flux | Black Forest Labs | Highest quality, great text rendering | Most popular right now |
| SDXL | Stability AI | Largest ecosystem, most LoRAs available | Mature but being surpassed |
| SD 1.5 | Stability AI | Oldest, tons of resources | Legacy. Still alive, barely |
Others You Might Encounter
- Hunyuan — Tencent. Big in the Chinese market
- HiDream — Newer open-source entry
- Chroma — Community fork of Flux
- CogVideoX — Video generation model
- Aura Flow — Smaller open-source model
Why So Many?
Every major AI lab is training their own base model. It's like smartphones — iPhone, Samsung, Pixel all do the same thing differently.
Critical point: LoRAs are locked to their base model family. A ZIT LoRA won't work on Flux. An SDXL LoRA won't work on ZIT. Always check compatibility.
Layer 8: Advanced Operations
txt2img (Text to Image)
The fundamental operation. Type a prompt → get an image. Where everyone starts.
img2img (Image to Image)
Feed the AI an existing image plus a prompt → it modifies the image based on your instructions. Use denoise strength to control how much changes.
Inpainting
Select a region of an image with a mask → the AI redraws only that region.
Perfect for fixing faces, changing outfits, or removing unwanted elements while keeping everything else intact.
Outpainting
Extend an image beyond its borders. A half-body portrait → outpaint into a full-body shot.
Face Swap / Head Swap
Replace one person's face with another. The current best approach on many models: inpainting + LoRA.
IP-Adapter
Use a reference image to guide generation — no LoRA training required. Hand it a photo and say "generate in this style" or "generate this person."
Caveat: Not supported on all base models. Works with SDXL and SD 1.5 but not ZIT (as of early 2026).
Virtual Try-On
Give the AI a person photo + a clothing photo → it outputs the person wearing that outfit.
Flux Klein has a dedicated Try-On LoRA that works surprisingly well.
Layer 9: Training
LoRA Training
Teaching the AI to recognize a new person, object, or style.
| Requirement | Details |
|---|---|
| Training images | 15–25 high-quality photos |
| Captions | A .txt file per image describing the content |
| GPU | 20–32 GB VRAM recommended |
| Time | A few hours depending on hardware |
| Tools | Ostris AI Toolkit, kohya-ss |
Trigger Word
A special keyword set during training. Include it in your prompt to activate the LoRA.
Example: You train a LoRA for a character called Hana using the trigger word hfujisawa. When you put hfujisawa in your prompt, the AI knows to use that LoRA's learned features.
Training Adapter
Required for training LoRAs on distilled models (like ZIT Turbo).
Distilled models are compressed. Training directly on them breaks their fast-inference ability. The adapter temporarily "decompresses" the model so training can proceed normally. After training, the adapter is discarded.
Captions
Every training image needs a .txt file describing what's in it.
Format: trigger_word, character description, scene description
The AI learns: "When I see this trigger word, it means this person."
Epoch vs. Steps
| Term | Meaning |
|---|---|
| Step | Processing one training image once |
| Epoch | Processing all training images once |
Example: 24 images, 3000 steps = 125 epochs.
Overfitting
Train too long and the AI memorizes instead of learns. Every output looks identical to the training data — correct face, but zero variety.
The fix: stop training at the right step count. Always generate test samples during training to catch this.
Layer 10: Hardware
VRAM (Video RAM)
The memory on your GPU. The single most important spec for Stable Diffusion.
- Image generation: ~8–16 GB
- LoRA training: ~20–32 GB
- Not enough VRAM = quantize or crash
GPU Comparison
| GPU | VRAM | Image Generation | LoRA Training |
|---|---|---|---|
| RTX 3060 | 12 GB | Barely works | Painful |
| RTX 3090 | 24 GB | Comfortable | Possible with quantization |
| RTX 4090 | 24 GB | Fast | Possible with quantization |
| RTX 5090 | 32 GB | Very fast | Comfortable |
OOM (Out of Memory)
What happens when your model doesn't fit in VRAM. The process crashes.
Fixes: Quantize the model, reduce resolution, close other GPU-hungry applications.
Layer 11: Platforms and Community
CivitAI — The App Store of Stable Diffusion
The largest SD community. Models, images, articles, and cloud generation all in one place.
What you'll find:
- Free model downloads — checkpoints, LoRAs, embeddings, everything
- Image galleries with full settings — see a great image? Click it and view the exact prompt, seed, model, and settings used. This is a goldmine for learning
- Cloud generation — generate images using CivitAI's GPUs (uses Buzz tokens — some free, then paid)
- Model publishing — share your own LoRAs, build a following
Why do people share models for free?
Same reason people open-source code. Community reputation, the satisfaction of building something useful, and sometimes monetization through Early Access (paid early downloads) or tips.
Etiquette: If you use someone's model and like it, leave a rating or comment. Model creators see their download counts — it's what keeps them going.
HuggingFace — GitHub for AI Models
The official repository for AI models.
CivitAI hosts community creations. HuggingFace hosts official releases.
- Flux official model → HuggingFace
- Z-Image Turbo official model → HuggingFace
- Someone's custom ZIT LoRA → CivitAI
Also hosts datasets, interactive demos (Spaces), and is where most training tools download base models from.
Analogy: HuggingFace is the factory parts warehouse. CivitAI is the custom mod shop.
ComfyUI Manager
The plugin marketplace for ComfyUI. One-click installation of custom nodes, extensions, and tools.
Other Useful Spots
| Platform | What it's for |
|---|---|
| Reddit (r/StableDiffusion, r/comfyui) | Discussion, tutorials, troubleshooting |
| GitHub | Source code for tools (ComfyUI itself lives here) |
| YouTube | Video tutorials, workflow walkthroughs |
| Discord | Real-time help from model/tool communities |
| Tensor.Art | Model sharing + cloud generation |
| LiblibAI | Chinese market equivalent of CivitAI |
Where to Start?
CivitAI for community models + inspiration. HuggingFace for official models + training tools. Reddit when you're stuck. These three cover 90% of what you'll need.
Hardware Recommendations for 2026
Alright, the part everyone actually wants to know. What should you buy?
Entry Level: RTX 3060 12GB / RTX 4060 8GB
- Recommended model: ZIT Turbo FP8 (All-in-One single file)
- Why: 8-step generation fits in 12GB with FP8 quantization
- Can do: txt2img, basic img2img
- Can't do: LoRA training (not enough VRAM), running Flux
- Alternative: Skip the GPU entirely. Use CivitAI's cloud generation
Mid-Range: RTX 3090 / RTX 4090 24GB
- Recommended models: ZIT Turbo (full version) + Flux Klein 9B (FP8)
- Why: 24GB handles most models at FP8 precision
- Can do: txt2img, img2img, inpainting, LoRA training (with quantization — be patient)
- LoRA training: Enable quantization, disable sampling during training. It works, just slower
- Precision: Stick with FP8. FP16 may OOM
High-End: RTX 5090 32GB / RTX Pro 6000
- Recommended models: ZIT Turbo BF16 + Flux Klein 9B full + community merges
- Why: 32GB means no compromises. Run everything at full precision
- Can do: Everything. Generation, training, multi-model workflows
- LoRA training: Smooth sailing. ~2 seconds per step
- Precision: Full BF16. No quantization needed
No Dedicated GPU / Laptop / Mac
Don't run locally. Use cloud services instead:
| Service | Cost | Notes |
|---|---|---|
| CivitAI On-site Generation | Free tier + paid | Easiest entry point |
| Google Colab | Free GPU tier | Good for experimenting |
| RunPod / Vast.ai | Pay per hour | Rent a real GPU when you need it |
Mac M-series can technically run SD, but it's slow. Not recommended for training.
Best Model Picks for 2026
| Use Case | Recommended Model | Why |
|---|---|---|
| Fast realistic portraits (Asian faces) | ZIT Turbo | 8-step generation, excellent with Asian features |
| Highest quality portraits | Flux 2 Klein 9B | Quality king. Best text rendering too |
| Photorealistic quick shots | Moody Real Mix (ZIT merge) | Community merge. More natural skin tones |
| LoRA training | Train on ZIT or Flux | ZIT for speed, Flux for quality |
| Virtual try-on / outfit swaps | Flux Klein Try-On LoRA | Currently the best working solution |
| Video generation | WAN 2.1 / CogVideoX | Evolving fast. Worth watching |
Getting Started: 5 Steps
- Install ComfyUI. Free, open-source, where the community lives
- Download ZIT Turbo AIO (FP8). One file. 8-step generation. Instant results
- Browse CivitAI. Look at images people have made. Click through to see their settings. This is the fastest way to learn
- Don't train a LoRA yet. Master basic generation first. Training is an advanced skill
- Join the community. r/StableDiffusion on Reddit, ComfyUI Discord. People are helpful — ask questions
The Stable Diffusion ecosystem moves fast. New models drop every week. But the fundamentals in this glossary are stable. Learn these concepts once and you'll be able to pick up any new tool or model that comes along.


Top comments (0)