Back to blog
Architecture SeriesPart 1 of 5

How Vectorlay Works:
The Big Picture

December 27, 2024
5 min read

When we set out to build Vectorlay, we had one guiding principle: nodes will fail. Consumer GPUs aren't enterprise hardware. Machines go offline, networks hiccup, and power flickers. The question isn't if failures happen—it's how you design around them.

This is the first article in a series exploring Vectorlay's architecture. We'll start with the big picture—how all the pieces fit together—then dive deep into each component in subsequent posts.

What is Vectorlay?

At its core, Vectorlay is an overlay network for GPU inference. Think of it like a CDN, but instead of caching static assets at edge locations, we're running ML inference across a distributed fleet of GPU nodes.

The key difference from traditional cloud GPU providers: we don't own the hardware. Anyone can contribute GPUs to the network—gamers with spare 4090s, small data centers, crypto miners looking for new revenue. We handle the orchestration, security, and routing.

The Architecture

┌─────────────┐     ┌─────────────────┐     ┌─────────────┐
│   Client    │────▶│   Edge Proxy    │────▶│   Cluster   │
│  (Your App) │     │  (Load Balancer)│     │  (Replicas) │
└─────────────┘     └─────────────────┘     └──────┬──────┘
                                                   │
                    ┌──────────────────────────────┼────────┐
                    │                              ▼        │
                    │    ┌─────────┐   ┌─────────┐         │
                    │    │ Replica │   │ Replica │ ...     │
                    │    │ (GPU 1) │   │ (GPU 2) │         │
                    │    └────┬────┘   └────┬────┘         │
                    │         │              │              │
                    │         ▼              ▼              │
                    │    ┌─────────┐   ┌─────────┐         │
                    │    │  Node   │   │  Node   │         │
                    │    │(Agent)  │   │(Agent)  │         │
                    │    └─────────┘   └─────────┘         │
                    │                                      │
                    │            GPU Network               │
                    └──────────────────────────────────────┘

Let's break down the key components:

Clusters

A cluster is your deployment unit. When you deploy to Vectorlay, you create a cluster that defines:

  • What GPU type you need (RTX 4090, 3090, etc.)
  • How many replicas to run
  • Your container image (any OCI-compatible image)
  • Environment variables and port mappings

The cluster gives you a stable endpoint URL. Behind that URL, traffic is automatically distributed across healthy replicas.

Replicas

Replicas are individual instances of your workload. Each replica runs on a single GPU node and can handle inference requests independently.

If you request 3 replicas, we schedule them across 3 different nodes (when possible) for maximum fault tolerance. If one node goes down, the other two keep serving traffic.

Nodes

Nodes are physical machines with GPUs, contributed by providers. Each node runs our agent software that:

  • Maintains a connection to the control plane
  • Reports hardware specs and health status
  • Executes deployment commands
  • Runs containers in isolated microVMs

Edge Proxy

The edge proxy is our smart load balancer. It:

  • Routes requests to healthy replicas only
  • Handles automatic failover when nodes go offline
  • Provides TLS termination and rate limiting
  • Gives you a stable endpoint regardless of underlying nodes

The Control Plane

Orchestrating all of this is the control plane:

Agents ──WebSocket──▶ Caddy ──▶ WS Server ──▶ Redis (BullMQ)
                                    │
                                    ▼
                               Supabase DB
  • WebSocket Server: Maintains persistent connections with all agents for real-time communication
  • Redis + BullMQ: Job queue for reliable deployment command delivery
  • Supabase: PostgreSQL database for all persistent state
  • Caddy: TLS termination and reverse proxy

Why This Architecture?

Every design decision stems from our core principle: assume failure.

WebSockets over HTTP Polling

We need to push commands to agents instantly. HTTP polling would add latency and make it harder to detect disconnections.

Job Queue for Deployments

If a node disconnects during a deploy command, the job stays queued. When it reconnects, the command executes. No lost deployments.

Replicas Spread Across Nodes

We actively avoid placing multiple replicas of the same cluster on the same node. One node failure should never take down your entire service.

Hardware Isolation via MicroVMs

Consumer GPUs on shared infrastructure need strong isolation. Kata Containers provide VM-level security with container ergonomics.

Dive Deeper

This overview covers the "what"—the rest of this series covers the "how". Each article dives deep into one component:

Next in series

The Control Plane

Deep dive into the WebSocket server, node registration flow, and how BullMQ ensures reliable command delivery.

Continue reading