The Control Plane
The control plane is the brain of Vectorlay. It coordinates everything: node registration, deployment scheduling, health monitoring, and failover decisions. Let's see how it works.
Architecture Overview
Agents ──WebSocket──▶ Caddy ──▶ WS Server ──▶ Redis (BullMQ)
│
▼
Supabase DBThe control plane consists of four main components working together. Let's examine each one.
The WebSocket Server
We chose WebSockets over HTTP polling for one simple reason: real-time bidirectional communication. The control plane needs to push commands to agents (deploy, stop, scale) while agents need to stream status updates back (heartbeats, deploy results, container health).
The WS server maintains a map of connected agents:
// Connected agents indexed by nodeId
const agents = new Map<string, AgentConnection>();
interface AgentConnection {
nodeId: string;
ws: WebSocket;
lastHeartbeat: Date;
}When an agent connects, the server verifies its identity (either via node ID for returning nodes, or provisioning token for new registrations), then adds it to the active connections map.
Message Protocol
All messages follow a simple JSON structure:
interface Message {
type: string; // "heartbeat", "deploy", "system_info", etc.
payload: unknown; // Type-specific data
}Message types from agents → server:
- →
heartbeat— Status check with container info - →
system_info— Hardware specs and dependency status - →
deploy_result— Deployment success or failure
Message types from server → agents:
- →
deploy— Start a container on this node - →
stop— Stop a running container - →
registration_success— Node ID assigned
Node Registration Flow
New nodes register using a provisioning token that providers generate from the dashboard. Here's the complete flow:
Provider Dashboard Control Plane
│ │ │
│──── Generate token ──────▶│ │
│ │ │
│◀─── vtk_abc123... ────────│ │
│ │ │
│ │ │
Agent (on new node) │ │
│ │ │
│────────── Connect with token + hardware info ───────────▶│
│ │
│ Validate token hash │
│ Create node record │
│ │
│◀────────────── registration_success { node_id } ─────────│
│ │
│ Save node_id to disk │
│ │
│──────────────────── system_info ────────────────────────▶│
│ │
│◀──────────────────── Ready for deploys ──────────────────│Token Security
Provisioning tokens follow a secure design:
- →Prefix:
vtk_followed by 8 chars for lookup - →Hash storage: Only SHA-256 hash stored in DB
- →One-way: Full token shown once at creation, never again
// Token validation in WS server
const tokenPrefix = orgToken.slice(0, 12); // "vtk_" + 8 chars
// Find org by prefix (fast lookup)
const [org] = await db
.select()
.from(organizations)
.where(eq(organizations.provisioningTokenPrefix, tokenPrefix));
// Verify full token hash
const tokenHash = crypto.createHash("sha256")
.update(orgToken)
.digest("hex");
if (tokenHash !== org.provisioningTokenHash) {
ws.close(4001, "Invalid provisioning token");
return;
}This zero-touch provisioning means providers can spin up new nodes with a single environment variable—no manual registration steps required.
Job Queue with BullMQ
Deploy commands don't go directly to agents. Instead, they go through a Redis-backed job queue (BullMQ). This gives us:
Retry Logic
If an agent disconnects mid-deploy, the job remains queued and retries when the node reconnects.
Rate Limiting
We can control how many concurrent deploys hit each node, preventing overload.
Persistence
Jobs survive control plane restarts. Redis persistence ensures no lost deployments.
Observability
Full job history for debugging. See exactly when jobs were queued, started, and completed.
Deploy Job Flow
// BullMQ Worker - processes deploy jobs from Redis
const deployWorker = new Worker("deploy", async (job) => {
const { nodeId, replicaId, image, gpuCount, envVars } = job.data;
const agent = agents.get(nodeId);
if (!agent) {
// Node not connected, job will retry
throw new Error(`Node ${nodeId} not connected`);
}
// Create job record in DB for tracking
const [dbJob] = await db.insert(jobs).values({
type: "deploy",
nodeId,
replicaId,
status: "processing",
payload: { image, gpuCount, envVars },
}).returning();
// Send command to agent via WebSocket
agent.ws.send(JSON.stringify({
type: "deploy",
payload: {
job_id: dbJob.id,
replica_id: replicaId,
image,
gpu_count: gpuCount,
env_vars: envVars,
},
}));
}, { connection: redis, concurrency: 10 });Database Schema
Everything is backed by PostgreSQL (via Supabase). We use Drizzle ORM for type-safe queries. Here's the core schema:
organizations
├── nodes (GPU machines from providers)
│ └── node_gpus (individual GPUs per node)
│
└── clusters (deployment configs)
└── replicas (running instances)
└── jobs (deploy/stop commands)Key tables:
- →nodes: Hardware specs, status, last heartbeat time
- →clusters: GPU type, replica count, container image, endpoint URL
- →replicas: Node assignment, health status, network info
- →jobs: Type, payload, status, attempts, timestamps
Continue Reading
The Big Picture
Overview of the distributed GPU overlay network
The Control Plane(You are here)
WebSocket server, node registration, and job queues
The Agent
Node software, heartbeats, and dependency management
GPU Passthrough with Kata
VFIO, microVMs, and hardware isolation
Fault Tolerance
Health checks, failover, and self-healing
The Agent
How the agent software runs on nodes, checks dependencies, sends heartbeats, and executes container deployments.
Continue reading