Back to blogEngineering Philosophy

Why We Keep Container Deployments Simple (And You Should Too)

December 27, 2024
10 min read

We could have built a mini-Kubernetes. Multi-container pods, sidecars, service meshes, the whole works. Here's why we deliberately didn't—and why that decision makes Vectorlay better.

"You're not deploying a microservice mesh. You're running inference."

When we designed Vectorlay's deployment model, we had a choice. We could build a sophisticated orchestration layer with all the bells and whistles—or we could build something radically simple that does one thing extremely well.

We chose simple. Not because it was easier to build (it wasn't), but because simplicity is the feature.

The Model We Chose

Vectorlay uses a dead-simple deployment model:

Cluster = deployment unit
├── One container image
├── One GPU type (RTX 4090, 3090, etc.)
├── N replicas (spread across nodes)
└── One stable endpoint URL

That's it. No pods. No sidecars. No compose files.
  • Cluster: Your deployment unit. Defines container image, GPU type, and replica count.
  • Replica: Individual instance running on a GPU node. The system manages placement automatically.
  • Endpoint: Stable URL that never changes. Edge proxy routes to healthy replicas.

Want multiple services? Create multiple clusters. Each scales independently, runs on appropriate hardware, and fails independently.

// Example: Audio processing pipeline
Clusters:
  whisper-transcription:
    image: myorg/whisper-server:latest
    gpu: RTX 4090
    replicas: 3
    endpoint: https://xyz.vectorlay.dev

  llm-processing:
    image: myorg/llama-server:latest
    gpu: RTX 4090
    replicas: 5
    endpoint: https://abc.vectorlay.dev

  tts-synthesis:
    image: myorg/coqui-tts:latest
    gpu: RTX 3090
    replicas: 2
    endpoint: https://def.vectorlay.dev

// Your app calls each endpoint independently
audio → whisper-transcription → llm-processing → tts-synthesis

What We Considered (And Rejected)

Before settling on single-container clusters, we seriously evaluated more complex models. Here's what we rejected and why.

Option 1: Pod/Sidecar Model

Like Kubernetes pods—co-located containers sharing localhost, network namespace, and lifecycle.

// Rejected: Pod model
pod:
  containers:
    - name: inference
      image: model-server
      gpu: 1
    - name: metrics
      image: prometheus-exporter
    - name: auth-proxy
      image: oauth-proxy

Use case

Metrics collectors, auth proxies, logging agents running alongside inference servers.

Why we rejected it

Scheduling complexity explodes. Co-location guarantees are hard on distributed consumer hardware. Unclear if anyone actually needs this.

Option 2: Docker Compose-Style YAML

Full multi-container definitions with dependencies, networks, volumes—the familiar docker-compose experience.

// Rejected: Compose-style
services:
  inference:
    image: model-server
    gpu: true
    depends_on:
      - redis
      - prometheus
  redis:
    image: redis:alpine
  prometheus:
    image: prom/prometheus
  grafana:
    image: grafana/grafana
    depends_on:
      - prometheus

Use case

Complex applications with databases, caches, and monitoring all bundled together.

Why we rejected it

Massive complexity for a distributed overlay network. Cross-node networking becomes a nightmare. We'd be fighting our own architecture.

Why Simple Wins for GPU Inference

Here's the thing: GPU inference workloads are fundamentally different from traditional microservices.

1. GPU workloads are typically single-container

You're running vLLM, TGI, Triton, or a custom inference server. That's one process, one container, one GPU. You're not deploying a microservice mesh—you're running inference.

2. Independent scaling matters

Your Whisper transcription needs 3 replicas on 4090s. Your LLM needs 8 replicas. Your TTS needs 2 on cheaper 3090s. Bundling them together means you can't scale independently.

3. Overlay networks are inherently distributed

Our nodes are consumer GPUs scattered globally—gamers, small data centers, crypto miners. Co-location guarantees are almost impossible when hardware is this distributed.

4. Kubernetes already exists

If you need full orchestration with service meshes, network policies, and complex dependency graphs—use GKE, EKS, or AKS. We solve a different problem: cheap GPU inference with auto-failover. That's the whole product.

The Philosophy: YAGNI

YAGNI: "You Aren't Gonna Need It." It's one of the oldest principles in software, and we take it seriously.

"Every abstraction we don't add is a failure mode we don't have."

  • Build for real demand: We add features when users actually need them, not when we imagine they might.
  • Every feature is maintenance burden: Code you don't write can't have bugs. Abstractions you don't build can't break.
  • Ship simple, iterate when users scream: If people desperately need sidecars, we'll know. They'll tell us. Loudly.

CPU-only clusters? Sidecars? Multi-container pods? We'll add them when people actually need them. Not before.

How We Compare

VectorlayKubernetesServerless GPU
Deploy timeSecondsMinutes to hoursSeconds
Container modelSinglePods / multi-containerSingle
Scaling unitClusterDeployment / PodFunction
NetworkingEdge proxyService meshManaged
YAML requiredNoneLotsMinimal
Target userML eng shipping fastPlatform teamsQuick experiments

The Trade-offs (We're Honest About These)

Every design choice has trade-offs. Here's what we give up:

What we give up

  • • No co-located sidecars
  • • No built-in service mesh
  • • No container-to-container localhost
  • • No complex dependency graphs
  • • No init containers

What we gain

  • • Deploy in seconds, not hours
  • • Clear mental model
  • • Fewer failure modes
  • • Simpler debugging
  • • Lower maintenance burden

What This Means for You

  • Deploy in seconds: No YAML wrangling. Pick your image, GPU, and replica count. Done.
  • Clear mental model: One cluster = one service = one endpoint. No hidden complexity.
  • Compose via endpoints: Build complex pipelines by calling multiple cluster endpoints. Audio → transcription → LLM → TTS. Simple HTTP calls.
  • Focus on your model: We handle the infrastructure. You focus on making your model better.

When to Use What

We're not the right choice for everything. Here's an honest guide:

Use Vectorlay when:

You need fast, cheap GPU inference with auto-failover. You're running ML models and want to ship quickly without infrastructure overhead.

Use Kubernetes when:

You need complex orchestration, service meshes, network policies, and your platform team can manage it. You have enterprise requirements Vectorlay doesn't cover.

Use serverless GPU when:

You're running quick experiments, need zero cold-start management, and don't care about persistent endpoints.

Complexity Is Easy. Simplicity Is Hard.

Anyone can add features. The hard part is knowing what not to build.

"Nodes fail. Consumer GPUs aren't enterprise hardware. Simple systems recover faster."

We're not AWS. We're not trying to be. We're cheap GPU inference with auto-failover, and that's the whole product. Every feature we don't add is a failure mode we don't have, a bug we don't introduce, and complexity we don't maintain.

We chose simple. We chose hard. And we think you'll love it.

Deploy your first cluster in 60 seconds

No YAML required. Pick your image, choose a GPU, set your replica count. That's it.

Get started free