How Vectorlay Works:
The Big Picture
When we set out to build Vectorlay, we had one guiding principle: nodes will fail. Consumer GPUs aren't enterprise hardware. Machines go offline, networks hiccup, and power flickers. The question isn't if failures happen—it's how you design around them.
This is the first article in a series exploring Vectorlay's architecture. We'll start with the big picture—how all the pieces fit together—then dive deep into each component in subsequent posts.
What is Vectorlay?
At its core, Vectorlay is an overlay network for GPU inference. Think of it like a CDN, but instead of caching static assets at edge locations, we're running ML inference across a distributed fleet of GPU nodes.
The key difference from traditional cloud GPU providers: we don't own the hardware. Anyone can contribute GPUs to the network—gamers with spare 4090s, small data centers, crypto miners looking for new revenue. We handle the orchestration, security, and routing.
The Architecture
┌─────────────┐ ┌─────────────────┐ ┌─────────────┐
│ Client │────▶│ Edge Proxy │────▶│ Cluster │
│ (Your App) │ │ (Load Balancer)│ │ (Replicas) │
└─────────────┘ └─────────────────┘ └──────┬──────┘
│
┌──────────────────────────────┼────────┐
│ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │ Replica │ │ Replica │ ... │
│ │ (GPU 1) │ │ (GPU 2) │ │
│ └────┬────┘ └────┬────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │ Node │ │ Node │ │
│ │(Agent) │ │(Agent) │ │
│ └─────────┘ └─────────┘ │
│ │
│ GPU Network │
└──────────────────────────────────────┘Let's break down the key components:
Clusters
A cluster is your deployment unit. When you deploy to Vectorlay, you create a cluster that defines:
- →What GPU type you need (RTX 4090, 3090, etc.)
- →How many replicas to run
- →Your container image (any OCI-compatible image)
- →Environment variables and port mappings
The cluster gives you a stable endpoint URL. Behind that URL, traffic is automatically distributed across healthy replicas.
Replicas
Replicas are individual instances of your workload. Each replica runs on a single GPU node and can handle inference requests independently.
If you request 3 replicas, we schedule them across 3 different nodes (when possible) for maximum fault tolerance. If one node goes down, the other two keep serving traffic.
Nodes
Nodes are physical machines with GPUs, contributed by providers. Each node runs our agent software that:
- →Maintains a connection to the control plane
- →Reports hardware specs and health status
- →Executes deployment commands
- →Runs containers in isolated microVMs
Edge Proxy
The edge proxy is our smart load balancer. It:
- →Routes requests to healthy replicas only
- →Handles automatic failover when nodes go offline
- →Provides TLS termination and rate limiting
- →Gives you a stable endpoint regardless of underlying nodes
The Control Plane
Orchestrating all of this is the control plane:
Agents ──WebSocket──▶ Caddy ──▶ WS Server ──▶ Redis (BullMQ)
│
▼
Supabase DB- →WebSocket Server: Maintains persistent connections with all agents for real-time communication
- →Redis + BullMQ: Job queue for reliable deployment command delivery
- →Supabase: PostgreSQL database for all persistent state
- →Caddy: TLS termination and reverse proxy
Why This Architecture?
Every design decision stems from our core principle: assume failure.
WebSockets over HTTP Polling
We need to push commands to agents instantly. HTTP polling would add latency and make it harder to detect disconnections.
Job Queue for Deployments
If a node disconnects during a deploy command, the job stays queued. When it reconnects, the command executes. No lost deployments.
Replicas Spread Across Nodes
We actively avoid placing multiple replicas of the same cluster on the same node. One node failure should never take down your entire service.
Hardware Isolation via MicroVMs
Consumer GPUs on shared infrastructure need strong isolation. Kata Containers provide VM-level security with container ergonomics.
Dive Deeper
This overview covers the "what"—the rest of this series covers the "how". Each article dives deep into one component:
The Big Picture(You are here)
Overview of the distributed GPU overlay network
The Control Plane
WebSocket server, node registration, and job queues
The Agent
Node software, heartbeats, and dependency management
GPU Passthrough with Kata
VFIO, microVMs, and hardware isolation
Fault Tolerance
Health checks, failover, and self-healing
The Control Plane
Deep dive into the WebSocket server, node registration flow, and how BullMQ ensures reliable command delivery.
Continue reading