Why We Keep Container Deployments Simple (And You Should Too)
We could have built a mini-Kubernetes. Multi-container pods, sidecars, service meshes, the whole works. Here's why we deliberately didn't—and why that decision makes Vectorlay better.
"You're not deploying a microservice mesh. You're running inference."
When we designed Vectorlay's deployment model, we had a choice. We could build a sophisticated orchestration layer with all the bells and whistles—or we could build something radically simple that does one thing extremely well.
We chose simple. Not because it was easier to build (it wasn't), but because simplicity is the feature.
The Model We Chose
Vectorlay uses a dead-simple deployment model:
Cluster = deployment unit ├── One container image ├── One GPU type (RTX 4090, 3090, etc.) ├── N replicas (spread across nodes) └── One stable endpoint URL That's it. No pods. No sidecars. No compose files.
- →Cluster: Your deployment unit. Defines container image, GPU type, and replica count.
- →Replica: Individual instance running on a GPU node. The system manages placement automatically.
- →Endpoint: Stable URL that never changes. Edge proxy routes to healthy replicas.
Want multiple services? Create multiple clusters. Each scales independently, runs on appropriate hardware, and fails independently.
// Example: Audio processing pipeline
Clusters:
whisper-transcription:
image: myorg/whisper-server:latest
gpu: RTX 4090
replicas: 3
endpoint: https://xyz.vectorlay.dev
llm-processing:
image: myorg/llama-server:latest
gpu: RTX 4090
replicas: 5
endpoint: https://abc.vectorlay.dev
tts-synthesis:
image: myorg/coqui-tts:latest
gpu: RTX 3090
replicas: 2
endpoint: https://def.vectorlay.dev
// Your app calls each endpoint independently
audio → whisper-transcription → llm-processing → tts-synthesisWhat We Considered (And Rejected)
Before settling on single-container clusters, we seriously evaluated more complex models. Here's what we rejected and why.
Option 1: Pod/Sidecar Model
Like Kubernetes pods—co-located containers sharing localhost, network namespace, and lifecycle.
// Rejected: Pod model
pod:
containers:
- name: inference
image: model-server
gpu: 1
- name: metrics
image: prometheus-exporter
- name: auth-proxy
image: oauth-proxyUse case
Metrics collectors, auth proxies, logging agents running alongside inference servers.
Why we rejected it
Scheduling complexity explodes. Co-location guarantees are hard on distributed consumer hardware. Unclear if anyone actually needs this.
Option 2: Docker Compose-Style YAML
Full multi-container definitions with dependencies, networks, volumes—the familiar docker-compose experience.
// Rejected: Compose-style
services:
inference:
image: model-server
gpu: true
depends_on:
- redis
- prometheus
redis:
image: redis:alpine
prometheus:
image: prom/prometheus
grafana:
image: grafana/grafana
depends_on:
- prometheusUse case
Complex applications with databases, caches, and monitoring all bundled together.
Why we rejected it
Massive complexity for a distributed overlay network. Cross-node networking becomes a nightmare. We'd be fighting our own architecture.
Why Simple Wins for GPU Inference
Here's the thing: GPU inference workloads are fundamentally different from traditional microservices.
1. GPU workloads are typically single-container
You're running vLLM, TGI, Triton, or a custom inference server. That's one process, one container, one GPU. You're not deploying a microservice mesh—you're running inference.
2. Independent scaling matters
Your Whisper transcription needs 3 replicas on 4090s. Your LLM needs 8 replicas. Your TTS needs 2 on cheaper 3090s. Bundling them together means you can't scale independently.
3. Overlay networks are inherently distributed
Our nodes are consumer GPUs scattered globally—gamers, small data centers, crypto miners. Co-location guarantees are almost impossible when hardware is this distributed.
4. Kubernetes already exists
If you need full orchestration with service meshes, network policies, and complex dependency graphs—use GKE, EKS, or AKS. We solve a different problem: cheap GPU inference with auto-failover. That's the whole product.
The Philosophy: YAGNI
YAGNI: "You Aren't Gonna Need It." It's one of the oldest principles in software, and we take it seriously.
"Every abstraction we don't add is a failure mode we don't have."
- →Build for real demand: We add features when users actually need them, not when we imagine they might.
- →Every feature is maintenance burden: Code you don't write can't have bugs. Abstractions you don't build can't break.
- →Ship simple, iterate when users scream: If people desperately need sidecars, we'll know. They'll tell us. Loudly.
CPU-only clusters? Sidecars? Multi-container pods? We'll add them when people actually need them. Not before.
How We Compare
| Vectorlay | Kubernetes | Serverless GPU | |
|---|---|---|---|
| Deploy time | Seconds | Minutes to hours | Seconds |
| Container model | Single | Pods / multi-container | Single |
| Scaling unit | Cluster | Deployment / Pod | Function |
| Networking | Edge proxy | Service mesh | Managed |
| YAML required | None | Lots | Minimal |
| Target user | ML eng shipping fast | Platform teams | Quick experiments |
The Trade-offs (We're Honest About These)
Every design choice has trade-offs. Here's what we give up:
What we give up
- • No co-located sidecars
- • No built-in service mesh
- • No container-to-container localhost
- • No complex dependency graphs
- • No init containers
What we gain
- • Deploy in seconds, not hours
- • Clear mental model
- • Fewer failure modes
- • Simpler debugging
- • Lower maintenance burden
What This Means for You
- →Deploy in seconds: No YAML wrangling. Pick your image, GPU, and replica count. Done.
- →Clear mental model: One cluster = one service = one endpoint. No hidden complexity.
- →Compose via endpoints: Build complex pipelines by calling multiple cluster endpoints. Audio → transcription → LLM → TTS. Simple HTTP calls.
- →Focus on your model: We handle the infrastructure. You focus on making your model better.
When to Use What
We're not the right choice for everything. Here's an honest guide:
Use Vectorlay when:
You need fast, cheap GPU inference with auto-failover. You're running ML models and want to ship quickly without infrastructure overhead.
Use Kubernetes when:
You need complex orchestration, service meshes, network policies, and your platform team can manage it. You have enterprise requirements Vectorlay doesn't cover.
Use serverless GPU when:
You're running quick experiments, need zero cold-start management, and don't care about persistent endpoints.
Complexity Is Easy. Simplicity Is Hard.
Anyone can add features. The hard part is knowing what not to build.
"Nodes fail. Consumer GPUs aren't enterprise hardware. Simple systems recover faster."
We're not AWS. We're not trying to be. We're cheap GPU inference with auto-failover, and that's the whole product. Every feature we don't add is a failure mode we don't have, a bug we don't introduce, and complexity we don't maintain.
We chose simple. We chose hard. And we think you'll love it.
Deploy your first cluster in 60 seconds
No YAML required. Pick your image, choose a GPU, set your replica count. That's it.
Get started free