The Control Plane: WebSockets, Registration, and Job Queues

Architecture Overview

Agents ──WebSocket──▶ Caddy ──▶ WS Server ──▶ Redis (BullMQ)
                                    │
                                    ▼
                               Supabase DB

The control plane consists of four main components working together. Let's examine each one.

The WebSocket Server

We chose WebSockets over HTTP polling for one simple reason: real-time bidirectional communication. The control plane needs to push commands to agents (deploy, stop, scale) while agents need to stream status updates back (heartbeats, deploy results, container health).

The WS server maintains a map of connected agents:

// Connected agents indexed by nodeId
const agents = new Map<string, AgentConnection>();

interface AgentConnection {
  nodeId: string;
  ws: WebSocket;
  lastHeartbeat: Date;
}

When an agent connects, the server verifies its identity (either via node ID for returning nodes, or provisioning token for new registrations), then adds it to the active connections map.

Message Protocol

All messages follow a simple JSON structure:

interface Message {
  type: string;        // "heartbeat", "deploy", "system_info", etc.
  payload: unknown;    // Type-specific data
}

Message types from agents → server:

→heartbeat — Status check with container info
→system_info — Hardware specs and dependency status
→deploy_result — Deployment success or failure

Message types from server → agents:

→deploy — Start a container on this node
→stop — Stop a running container
→registration_success — Node ID assigned

Node Registration Flow

New nodes register using a provisioning token that providers generate from the dashboard. Here's the complete flow:

Provider                    Dashboard                    Control Plane
    │                           │                              │
    │──── Generate token ──────▶│                              │
    │                           │                              │
    │◀─── vtk_abc123... ────────│                              │
    │                           │                              │
    │                           │                              │
Agent (on new node)             │                              │
    │                           │                              │
    │────────── Connect with token + hardware info ───────────▶│
    │                                                          │
    │                                      Validate token hash │
    │                                      Create node record  │
    │                                                          │
    │◀────────────── registration_success { node_id } ─────────│
    │                                                          │
    │  Save node_id to disk                                    │
    │                                                          │
    │──────────────────── system_info ────────────────────────▶│
    │                                                          │
    │◀──────────────────── Ready for deploys ──────────────────│

Token Security

Provisioning tokens follow a secure design:

→Prefix: vtk_ followed by 8 chars for lookup
→Hash storage: Only SHA-256 hash stored in DB
→One-way: Full token shown once at creation, never again

// Token validation in WS server
const tokenPrefix = orgToken.slice(0, 12); // "vtk_" + 8 chars

// Find org by prefix (fast lookup)
const [org] = await db
  .select()
  .from(organizations)
  .where(eq(organizations.provisioningTokenPrefix, tokenPrefix));

// Verify full token hash
const tokenHash = crypto.createHash("sha256")
  .update(orgToken)
  .digest("hex");

if (tokenHash !== org.provisioningTokenHash) {
  ws.close(4001, "Invalid provisioning token");
  return;
}

This zero-touch provisioning means providers can spin up new nodes with a single environment variable—no manual registration steps required.

Job Queue with BullMQ

Deploy commands don't go directly to agents. Instead, they go through a Redis-backed job queue (BullMQ). This gives us:

Retry Logic

If an agent disconnects mid-deploy, the job remains queued and retries when the node reconnects.

Rate Limiting

We can control how many concurrent deploys hit each node, preventing overload.

Persistence

Jobs survive control plane restarts. Redis persistence ensures no lost deployments.

Observability

Full job history for debugging. See exactly when jobs were queued, started, and completed.

Deploy Job Flow

// BullMQ Worker - processes deploy jobs from Redis
const deployWorker = new Worker("deploy", async (job) => {
  const { nodeId, replicaId, image, gpuCount, envVars } = job.data;

  const agent = agents.get(nodeId);
  if (!agent) {
    // Node not connected, job will retry
    throw new Error(`Node ${nodeId} not connected`);
  }

  // Create job record in DB for tracking
  const [dbJob] = await db.insert(jobs).values({
    type: "deploy",
    nodeId,
    replicaId,
    status: "processing",
    payload: { image, gpuCount, envVars },
  }).returning();

  // Send command to agent via WebSocket
  agent.ws.send(JSON.stringify({
    type: "deploy",
    payload: {
      job_id: dbJob.id,
      replica_id: replicaId,
      image,
      gpu_count: gpuCount,
      env_vars: envVars,
    },
  }));
}, { connection: redis, concurrency: 10 });

Database Schema

Everything is backed by PostgreSQL (via Supabase). We use Drizzle ORM for type-safe queries. Here's the core schema:

organizations
  ├── nodes (GPU machines from providers)
  │     └── node_gpus (individual GPUs per node)
  │
  └── clusters (deployment configs)
        └── replicas (running instances)
              └── jobs (deploy/stop commands)

Key tables:

→nodes: Hardware specs, status, last heartbeat time
→clusters: GPU type, replica count, container image, endpoint URL
→replicas: Node assignment, health status, network info
→jobs: Type, payload, status, attempts, timestamps

Continue Reading

The Big Picture

Overview of the distributed GPU overlay network

The Control Plane(You are here)

WebSocket server, node registration, and job queues

The Agent

Node software, heartbeats, and dependency management

GPU Passthrough with Kata

VFIO, microVMs, and hardware isolation

Fault Tolerance

Health checks, failover, and self-healing

Next in series

The Agent

How the agent software runs on nodes, checks dependencies, sends heartbeats, and executes container deployments.