How Vectorlay Works: The Big Picture

This is the first article in a series exploring Vectorlay's architecture. We'll start with the big picture—how all the pieces fit together—then dive deep into each component in subsequent posts.

What is Vectorlay?

At its core, Vectorlay is an overlay network for GPU inference. Think of it like a CDN, but instead of caching static assets at edge locations, we're running ML inference across a distributed fleet of GPU nodes.

The key difference from traditional cloud GPU providers: we don't own the hardware. Anyone can contribute GPUs to the network—gamers with spare 4090s, small data centers, crypto miners looking for new revenue. We handle the orchestration, security, and routing.

The Architecture

┌─────────────┐     ┌─────────────────┐     ┌─────────────┐
│   Client    │────▶│   Edge Proxy    │────▶│   Cluster   │
│  (Your App) │     │  (Load Balancer)│     │  (Replicas) │
└─────────────┘     └─────────────────┘     └──────┬──────┘
                                                   │
                    ┌──────────────────────────────┼────────┐
                    │                              ▼        │
                    │    ┌─────────┐   ┌─────────┐         │
                    │    │ Replica │   │ Replica │ ...     │
                    │    │ (GPU 1) │   │ (GPU 2) │         │
                    │    └────┬────┘   └────┬────┘         │
                    │         │              │              │
                    │         ▼              ▼              │
                    │    ┌─────────┐   ┌─────────┐         │
                    │    │  Node   │   │  Node   │         │
                    │    │(Agent)  │   │(Agent)  │         │
                    │    └─────────┘   └─────────┘         │
                    │                                      │
                    │            GPU Network               │
                    └──────────────────────────────────────┘

Let's break down the key components:

Clusters

A cluster is your deployment unit. When you deploy to Vectorlay, you create a cluster that defines:

→What GPU type you need (RTX 4090, 3090, etc.)
→How many replicas to run
→Your container image (any OCI-compatible image)
→Environment variables and port mappings

The cluster gives you a stable endpoint URL. Behind that URL, traffic is automatically distributed across healthy replicas.

Replicas

Replicas are individual instances of your workload. Each replica runs on a single GPU node and can handle inference requests independently.

If you request 3 replicas, we schedule them across 3 different nodes (when possible) for maximum fault tolerance. If one node goes down, the other two keep serving traffic.

Nodes

Nodes are physical machines with GPUs, contributed by providers. Each node runs our agent software that:

→Maintains a connection to the control plane
→Reports hardware specs and health status
→Executes deployment commands
→Runs containers in isolated microVMs

Edge Proxy

The edge proxy is our smart load balancer. It:

→Routes requests to healthy replicas only
→Handles automatic failover when nodes go offline
→Provides TLS termination and rate limiting
→Gives you a stable endpoint regardless of underlying nodes

The Control Plane

Orchestrating all of this is the control plane:

Agents ──WebSocket──▶ Caddy ──▶ WS Server ──▶ Redis (BullMQ)
                                    │
                                    ▼
                               Supabase DB

→WebSocket Server: Maintains persistent connections with all agents for real-time communication
→Redis + BullMQ: Job queue for reliable deployment command delivery
→Supabase: PostgreSQL database for all persistent state
→Caddy: TLS termination and reverse proxy

Why This Architecture?

Every design decision stems from our core principle: assume failure.

WebSockets over HTTP Polling

We need to push commands to agents instantly. HTTP polling would add latency and make it harder to detect disconnections.

Job Queue for Deployments

If a node disconnects during a deploy command, the job stays queued. When it reconnects, the command executes. No lost deployments.

Replicas Spread Across Nodes

We actively avoid placing multiple replicas of the same cluster on the same node. One node failure should never take down your entire service.

Hardware Isolation via MicroVMs

Consumer GPUs on shared infrastructure need strong isolation. Kata Containers provide VM-level security with container ergonomics.

Dive Deeper

This overview covers the "what"—the rest of this series covers the "how". Each article dives deep into one component:

The Big Picture(You are here)

Overview of the distributed GPU overlay network

The Control Plane

WebSocket server, node registration, and job queues

The Agent

Node software, heartbeats, and dependency management

GPU Passthrough with Kata

VFIO, microVMs, and hardware isolation

Fault Tolerance

Health checks, failover, and self-healing

Next in series

The Control Plane

Deep dive into the WebSocket server, node registration flow, and how BullMQ ensures reliable command delivery.

How Vectorlay Works:The Big Picture

What is Vectorlay?

The Architecture

Clusters

Replicas

Nodes

Edge Proxy

The Control Plane

Why This Architecture?

WebSockets over HTTP Polling

Job Queue for Deployments

Replicas Spread Across Nodes

Hardware Isolation via MicroVMs

Dive Deeper

The Big Picture(You are here)

The Control Plane

The Agent

GPU Passthrough with Kata

Fault Tolerance

The Control Plane

How Vectorlay Works:
The Big Picture