Running Stable Diffusion XL at Scale

Why SDXL on VectorLay?

Stable Diffusion XL is a beast. The base model needs ~6.5GB of VRAM, and with refiner and optimizations, you want at least 12GB to run comfortably. That means RTX 4090s or better.

VectorLay gives you access to a network of 4090s at a fraction of cloud prices. More importantly, you can scale horizontally—add more GPUs when you need them, pay only for what you use.

The Container Setup

We'll use a pre-built container that includes SDXL with several optimizations:

PyTorch 2.0 with torch.compile()
xFormers for memory-efficient attention
FP16 precision for faster inference
Model caching to avoid repeated downloads

Dockerfile

FROM nvidia/cuda:12.1-runtime-ubuntu22.04

# Install Python and dependencies
RUN apt-get update && apt-get install -y python3 python3-pip
RUN pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu121
RUN pip3 install diffusers transformers accelerate xformers fastapi uvicorn

# Download SDXL weights at build time
RUN python3 -c "from diffusers import StableDiffusionXLPipeline; \
    StableDiffusionXLPipeline.from_pretrained('stabilityai/stable-diffusion-xl-base-1.0')"

COPY server.py /app/server.py
WORKDIR /app

EXPOSE 8080
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080"]

server.py

import torch
from fastapi import FastAPI
from pydantic import BaseModel
from diffusers import StableDiffusionXLPipeline
import base64
from io import BytesIO

app = FastAPI()

# Load model on startup
pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True
).to("cuda")

# Enable optimizations
pipe.enable_xformers_memory_efficient_attention()
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)

class GenerateRequest(BaseModel):
    prompt: str
    negative_prompt: str = ""
    steps: int = 30
    guidance_scale: float = 7.5
    width: int = 1024
    height: int = 1024

@app.post("/generate")
async def generate(request: GenerateRequest):
    image = pipe(
        prompt=request.prompt,
        negative_prompt=request.negative_prompt,
        num_inference_steps=request.steps,
        guidance_scale=request.guidance_scale,
        width=request.width,
        height=request.height
    ).images[0]
    
    buffer = BytesIO()
    image.save(buffer, format="PNG")
    img_base64 = base64.b64encode(buffer.getvalue()).decode()
    
    return {"image": img_base64}

@app.get("/health")
async def health():
    return {"status": "healthy", "gpu": torch.cuda.get_device_name()}

Deploying on VectorLay

Push your image to a registry (Docker Hub, GHCR, or your private registry), then deploy:

import vectorlay

client = vectorlay.Client(api_key="YOUR_API_KEY")

cluster = client.clusters.create(
    name="sdxl-prod",
    image="your-registry/sdxl-server:latest",
    gpu_type="rtx-4090",
    replicas=4,  # 4 GPUs for parallel generation
    port=8080,
    health_check_path="/health",
    env={
        "PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:512"
    }
)

cluster.wait_until_ready()
print(f"Cluster ready at: {cluster.endpoint}")

Generating Images

With your cluster running, generating images is a simple API call:

import requests
import base64

response = requests.post(
    "https://your-cluster.vectorlay.dev/generate",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "prompt": "A cyberpunk cityscape at sunset, neon lights reflecting on wet streets, highly detailed, 8k",
        "negative_prompt": "blurry, low quality, distorted",
        "steps": 30,
        "guidance_scale": 7.5
    }
)

# Decode and save the image
img_data = base64.b64decode(response.json()["image"])
with open("output.png", "wb") as f:
    f.write(img_data)

Parallel Generation

With multiple replicas, you can generate images in parallel. VectorLay automatically load balances across all healthy replicas:

import asyncio
import aiohttp

async def generate_image(session, prompt):
    async with session.post(
        "https://your-cluster.vectorlay.dev/generate",
        headers={"Authorization": "Bearer YOUR_API_KEY"},
        json={"prompt": prompt, "steps": 30}
    ) as response:
        return await response.json()

async def batch_generate(prompts):
    async with aiohttp.ClientSession() as session:
        tasks = [generate_image(session, p) for p in prompts]
        return await asyncio.gather(*tasks)

# Generate 100 images in parallel
prompts = [f"A beautiful landscape, variation {i}" for i in range(100)]
results = asyncio.run(batch_generate(prompts))

print(f"Generated {len(results)} images")

Performance Benchmarks

We benchmarked SDXL on various GPU types available on VectorLay:

GPU	Time/Image	Images/Hour	Cost/Image
RTX 4090	2.1s	1,714	$0.0004
RTX 4080	3.4s	1,058	$0.0005
RTX 3090	4.2s	857	$0.0004
A100 40GB	1.8s	2,000	$0.0009

Benchmarks: SDXL base, 1024x1024, 30 steps, FP16, xFormers enabled

Cost Analysis

Let's compare the cost of generating 100,000 images:

Provider	GPU	Time	Cost
VectorLay	4x RTX 4090	14.6 hours	$40
AWS	4x A10G	18 hours	$288
RunPod	4x RTX 4090	14.6 hours	$76
Replicate	—	—	$400+

Optimization Tips

Use torch.compile(): Reduces inference time by 15-20% after warmup
Enable xFormers: Memory-efficient attention uses 30% less VRAM
Batch your requests: Group multiple prompts per request if your use case allows
Lower steps for drafts: 20 steps is often good enough for previews
Cache your models: Mount a persistent volume to avoid re-downloading

Auto-Scaling for Variable Load

If your traffic is bursty, you can scale replicas based on queue depth:

# Check current metrics
metrics = cluster.get_metrics()

if metrics.queue_depth > 50:
    cluster.scale(replicas=cluster.replicas + 2)
elif metrics.queue_depth < 5 and cluster.replicas > 1:
    cluster.scale(replicas=max(1, cluster.replicas - 1))

Real-World Use Cases

Here's what teams are building with SDXL on VectorLay:

E-commerce: Product image variations and lifestyle shots
Gaming: Procedural texture and asset generation
Marketing: Ad creative generation and A/B testing
Stock photography: On-demand illustration generation