Running Stable Diffusion XL at Scale
Generate thousands of images per hour with SDXL on VectorLay. This guide covers deployment, optimization, and real-world cost analysis.
Why SDXL on VectorLay?
Stable Diffusion XL is a beast. The base model needs ~6.5GB of VRAM, and with refiner and optimizations, you want at least 12GB to run comfortably. That means RTX 4090s or better.
VectorLay gives you access to a network of 4090s at a fraction of cloud prices. More importantly, you can scale horizontally—add more GPUs when you need them, pay only for what you use.
The Container Setup
We'll use a pre-built container that includes SDXL with several optimizations:
- PyTorch 2.0 with
torch.compile() - xFormers for memory-efficient attention
- FP16 precision for faster inference
- Model caching to avoid repeated downloads
Dockerfile
FROM nvidia/cuda:12.1-runtime-ubuntu22.04
# Install Python and dependencies
RUN apt-get update && apt-get install -y python3 python3-pip
RUN pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu121
RUN pip3 install diffusers transformers accelerate xformers fastapi uvicorn
# Download SDXL weights at build time
RUN python3 -c "from diffusers import StableDiffusionXLPipeline; \
StableDiffusionXLPipeline.from_pretrained('stabilityai/stable-diffusion-xl-base-1.0')"
COPY server.py /app/server.py
WORKDIR /app
EXPOSE 8080
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080"]server.py
import torch
from fastapi import FastAPI
from pydantic import BaseModel
from diffusers import StableDiffusionXLPipeline
import base64
from io import BytesIO
app = FastAPI()
# Load model on startup
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16,
variant="fp16",
use_safetensors=True
).to("cuda")
# Enable optimizations
pipe.enable_xformers_memory_efficient_attention()
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
class GenerateRequest(BaseModel):
prompt: str
negative_prompt: str = ""
steps: int = 30
guidance_scale: float = 7.5
width: int = 1024
height: int = 1024
@app.post("/generate")
async def generate(request: GenerateRequest):
image = pipe(
prompt=request.prompt,
negative_prompt=request.negative_prompt,
num_inference_steps=request.steps,
guidance_scale=request.guidance_scale,
width=request.width,
height=request.height
).images[0]
buffer = BytesIO()
image.save(buffer, format="PNG")
img_base64 = base64.b64encode(buffer.getvalue()).decode()
return {"image": img_base64}
@app.get("/health")
async def health():
return {"status": "healthy", "gpu": torch.cuda.get_device_name()}Deploying on VectorLay
Push your image to a registry (Docker Hub, GHCR, or your private registry), then deploy:
import vectorlay
client = vectorlay.Client(api_key="YOUR_API_KEY")
cluster = client.clusters.create(
name="sdxl-prod",
image="your-registry/sdxl-server:latest",
gpu_type="rtx-4090",
replicas=4, # 4 GPUs for parallel generation
port=8080,
health_check_path="/health",
env={
"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:512"
}
)
cluster.wait_until_ready()
print(f"Cluster ready at: {cluster.endpoint}")Generating Images
With your cluster running, generating images is a simple API call:
import requests
import base64
response = requests.post(
"https://your-cluster.vectorlay.dev/generate",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"prompt": "A cyberpunk cityscape at sunset, neon lights reflecting on wet streets, highly detailed, 8k",
"negative_prompt": "blurry, low quality, distorted",
"steps": 30,
"guidance_scale": 7.5
}
)
# Decode and save the image
img_data = base64.b64decode(response.json()["image"])
with open("output.png", "wb") as f:
f.write(img_data)Parallel Generation
With multiple replicas, you can generate images in parallel. VectorLay automatically load balances across all healthy replicas:
import asyncio
import aiohttp
async def generate_image(session, prompt):
async with session.post(
"https://your-cluster.vectorlay.dev/generate",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={"prompt": prompt, "steps": 30}
) as response:
return await response.json()
async def batch_generate(prompts):
async with aiohttp.ClientSession() as session:
tasks = [generate_image(session, p) for p in prompts]
return await asyncio.gather(*tasks)
# Generate 100 images in parallel
prompts = [f"A beautiful landscape, variation {i}" for i in range(100)]
results = asyncio.run(batch_generate(prompts))
print(f"Generated {len(results)} images")Performance Benchmarks
We benchmarked SDXL on various GPU types available on VectorLay:
| GPU | Time/Image | Images/Hour | Cost/Image |
|---|---|---|---|
| RTX 4090 | 2.1s | 1,714 | $0.0004 |
| RTX 4080 | 3.4s | 1,058 | $0.0005 |
| RTX 3090 | 4.2s | 857 | $0.0004 |
| A100 40GB | 1.8s | 2,000 | $0.0009 |
Benchmarks: SDXL base, 1024x1024, 30 steps, FP16, xFormers enabled
Cost Analysis
Let's compare the cost of generating 100,000 images:
| Provider | GPU | Time | Cost |
|---|---|---|---|
| VectorLay | 4x RTX 4090 | 14.6 hours | $40 |
| AWS | 4x A10G | 18 hours | $288 |
| RunPod | 4x RTX 4090 | 14.6 hours | $76 |
| Replicate | — | — | $400+ |
Optimization Tips
- Use torch.compile(): Reduces inference time by 15-20% after warmup
- Enable xFormers: Memory-efficient attention uses 30% less VRAM
- Batch your requests: Group multiple prompts per request if your use case allows
- Lower steps for drafts: 20 steps is often good enough for previews
- Cache your models: Mount a persistent volume to avoid re-downloading
Auto-Scaling for Variable Load
If your traffic is bursty, you can scale replicas based on queue depth:
# Check current metrics
metrics = cluster.get_metrics()
if metrics.queue_depth > 50:
cluster.scale(replicas=cluster.replicas + 2)
elif metrics.queue_depth < 5 and cluster.replicas > 1:
cluster.scale(replicas=max(1, cluster.replicas - 1))Real-World Use Cases
Here's what teams are building with SDXL on VectorLay:
- E-commerce: Product image variations and lifestyle shots
- Gaming: Procedural texture and asset generation
- Marketing: Ad creative generation and A/B testing
- Stock photography: On-demand illustration generation
Next Steps
Start generating images today
Get $10 in free credits—enough to generate ~25,000 images.
Start Building→