Back to blog
Use CaseDecember 28, 2024• 10 min read

Running Stable Diffusion XL at Scale

Generate thousands of images per hour with SDXL on VectorLay. This guide covers deployment, optimization, and real-world cost analysis.

2.1s
per image (1024x1024)
1,700+
images/hour/GPU
$0.0004
per image

Why SDXL on VectorLay?

Stable Diffusion XL is a beast. The base model needs ~6.5GB of VRAM, and with refiner and optimizations, you want at least 12GB to run comfortably. That means RTX 4090s or better.

VectorLay gives you access to a network of 4090s at a fraction of cloud prices. More importantly, you can scale horizontally—add more GPUs when you need them, pay only for what you use.

The Container Setup

We'll use a pre-built container that includes SDXL with several optimizations:

  • PyTorch 2.0 with torch.compile()
  • xFormers for memory-efficient attention
  • FP16 precision for faster inference
  • Model caching to avoid repeated downloads

Dockerfile

FROM nvidia/cuda:12.1-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y python3 python3-pip
RUN pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu121
RUN pip3 install diffusers transformers accelerate xformers fastapi uvicorn

# Download SDXL weights at build time
RUN python3 -c "from diffusers import StableDiffusionXLPipeline; \
    StableDiffusionXLPipeline.from_pretrained('stabilityai/stable-diffusion-xl-base-1.0')"

COPY server.py /app/server.py
WORKDIR /app

EXPOSE 8080
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080"]

server.py

import torch
from fastapi import FastAPI
from pydantic import BaseModel
from diffusers import StableDiffusionXLPipeline
import base64
from io import BytesIO

app = FastAPI()

# Load model on startup
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
).to("cuda")

# Enable optimizations
pipe.enable_xformers_memory_efficient_attention()
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

class GenerateRequest(BaseModel):
    prompt: str
    negative_prompt: str = ""
    steps: int = 30
    guidance_scale: float = 7.5
    width: int = 1024
    height: int = 1024

@app.post("/generate")
async def generate(request: GenerateRequest):
    image = pipe(
        prompt=request.prompt,
        negative_prompt=request.negative_prompt,
        num_inference_steps=request.steps,
        guidance_scale=request.guidance_scale,
        width=request.width,
        height=request.height
    ).images[0]
    
    buffer = BytesIO()
    image.save(buffer, format="PNG")
    img_base64 = base64.b64encode(buffer.getvalue()).decode()
    
    return {"image": img_base64}

@app.get("/health")
async def health():
    return {"status": "healthy", "gpu": torch.cuda.get_device_name()}

Deploying on VectorLay

Push your image to a registry (Docker Hub, GHCR, or your private registry), then deploy:

import vectorlay

client = vectorlay.Client(api_key="YOUR_API_KEY")

cluster = client.clusters.create(
    name="sdxl-prod",
    image="your-registry/sdxl-server:latest",
    gpu_type="rtx-4090",
    replicas=4,  # 4 GPUs for parallel generation
    port=8080,
    health_check_path="/health",
    env={
        "PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:512"
    }
)

cluster.wait_until_ready()
print(f"Cluster ready at: {cluster.endpoint}")

Generating Images

With your cluster running, generating images is a simple API call:

import requests
import base64

response = requests.post(
    "https://your-cluster.vectorlay.dev/generate",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "prompt": "A cyberpunk cityscape at sunset, neon lights reflecting on wet streets, highly detailed, 8k",
        "negative_prompt": "blurry, low quality, distorted",
        "steps": 30,
        "guidance_scale": 7.5
    }
)

# Decode and save the image
img_data = base64.b64decode(response.json()["image"])
with open("output.png", "wb") as f:
    f.write(img_data)

Parallel Generation

With multiple replicas, you can generate images in parallel. VectorLay automatically load balances across all healthy replicas:

import asyncio
import aiohttp

async def generate_image(session, prompt):
    async with session.post(
        "https://your-cluster.vectorlay.dev/generate",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        json={"prompt": prompt, "steps": 30}
    ) as response:
        return await response.json()

async def batch_generate(prompts):
    async with aiohttp.ClientSession() as session:
        tasks = [generate_image(session, p) for p in prompts]
        return await asyncio.gather(*tasks)

# Generate 100 images in parallel
prompts = [f"A beautiful landscape, variation {i}" for i in range(100)]
results = asyncio.run(batch_generate(prompts))

print(f"Generated {len(results)} images")

Performance Benchmarks

We benchmarked SDXL on various GPU types available on VectorLay:

GPUTime/ImageImages/HourCost/Image
RTX 40902.1s1,714$0.0004
RTX 40803.4s1,058$0.0005
RTX 30904.2s857$0.0004
A100 40GB1.8s2,000$0.0009

Benchmarks: SDXL base, 1024x1024, 30 steps, FP16, xFormers enabled

Cost Analysis

Let's compare the cost of generating 100,000 images:

ProviderGPUTimeCost
VectorLay4x RTX 409014.6 hours$40
AWS4x A10G18 hours$288
RunPod4x RTX 409014.6 hours$76
Replicate$400+

Optimization Tips

  • Use torch.compile(): Reduces inference time by 15-20% after warmup
  • Enable xFormers: Memory-efficient attention uses 30% less VRAM
  • Batch your requests: Group multiple prompts per request if your use case allows
  • Lower steps for drafts: 20 steps is often good enough for previews
  • Cache your models: Mount a persistent volume to avoid re-downloading

Auto-Scaling for Variable Load

If your traffic is bursty, you can scale replicas based on queue depth:

# Check current metrics
metrics = cluster.get_metrics()

if metrics.queue_depth > 50:
    cluster.scale(replicas=cluster.replicas + 2)
elif metrics.queue_depth < 5 and cluster.replicas > 1:
    cluster.scale(replicas=max(1, cluster.replicas - 1))

Real-World Use Cases

Here's what teams are building with SDXL on VectorLay:

  • E-commerce: Product image variations and lifestyle shots
  • Gaming: Procedural texture and asset generation
  • Marketing: Ad creative generation and A/B testing
  • Stock photography: On-demand illustration generation

Next Steps

Start generating images today

Get $10 in free credits—enough to generate ~25,000 images.

Start Building