The Agent
The agent is a lightweight daemon that runs on every GPU node in the network. It's the bridge between the control plane and the actual hardware—handling everything from registration to container lifecycle management.
What the Agent Does
The agent is responsible for:
- →Maintaining a persistent WebSocket connection to the control plane
- →Reporting hardware specs (CPU, RAM, GPUs, storage)
- →Checking and reporting dependency status
- →Sending heartbeats every 30 seconds
- →Executing deploy/stop commands via containerd
- →Reporting deployment results back to the control plane
Startup Flow
When the agent starts, it follows this sequence:
┌─────────────────────────────────────────────────────────────┐ │ Agent Startup │ ├─────────────────────────────────────────────────────────────┤ │ │ │ 1. Check for saved node config (/var/lib/vectorlay/) │ │ │ │ │ ├── Found → Load node ID, connect directly │ │ │ │ │ └── Not found → Check for org token │ │ │ │ │ ├── Token set → Register as new node │ │ │ │ │ └── No token → Exit with error │ │ │ │ 2. Connect to control plane via WebSocket │ │ │ │ 3. Send system_info (hardware + dependencies) │ │ │ │ 4. Start heartbeat loop (every 30s) │ │ │ │ 5. Listen for commands (deploy, stop) │ │ │ └─────────────────────────────────────────────────────────────┘
Hardware Detection
On startup and reconnection, the agent gathers hardware info:
async function getHardwareInfo() {
const hostname = os.hostname();
const cpuModel = os.cpus()[0]?.model || "Unknown";
const cpuCores = os.cpus().length;
const ramGb = Math.round(os.totalmem() / (1024 * 1024 * 1024));
// Detect GPUs via lspci (works even with VFIO passthrough)
let gpuInfo: string[] = [];
try {
const { stdout } = await execAsync(
"lspci | grep -i 'vga\\|3d\\|display' | grep -i 'nvidia\\|amd'"
);
gpuInfo = stdout.trim().split("\n").filter(Boolean);
} catch {
// No GPUs found or lspci not available
}
return { hostname, cpuModel, cpuCores, ramGb, gpuInfo };
}Note that we use lspci instead of nvidia-smi. This is intentional—when GPUs are bound to VFIO for passthrough, they're not visible to host NVIDIA drivers, but lspci still sees them at the PCI level.
Dependency Checking
Not all nodes are created equal. Some might be missing dependencies. The agent checks for required components on every connection:
// Required dependencies for GPU workloads containerd ✓ 1.7.2 // Container runtime nerdctl ✓ 1.5.0 // CLI for containerd kata-runtime ✓ 3.2.0 // MicroVM runtime virtiofsd ✓ 1.8.0 // Filesystem passthrough vfio ✓ 4 modules // GPU passthrough kernel modules
The check results determine node status:
ready = true → online
All core dependencies installed. Node can accept deployments.
ready = false → degraded
Connected but missing dependencies. Won't receive deployments until fixed.
Node Status State Machine
┌─────────────────────────────────────────────────────────┐ │ Node Statuses │ ├─────────────────────────────────────────────────────────┤ │ online Connected, deps ready, accepting deploys │ │ offline Socket disconnected │ │ degraded Connected but missing dependencies │ │ draining Finishing work, no new deploys │ │ maintenance Manual maintenance mode │ └─────────────────────────────────────────────────────────┘ State Transitions: ───────────────── offline ──(connect)──▶ degraded ──(deps ok)──▶ online online ──(disconnect)──▶ offline online ──(deps fail)──▶ degraded online ──(operator)──▶ draining ──▶ maintenance
Heartbeat System
Every 30 seconds, the agent sends a heartbeat with current status:
// Heartbeat payload
{
"type": "heartbeat",
"payload": {
"status": {
"containers": [
{ "replica_id": "abc-123", "status": "running" },
{ "replica_id": "def-456", "status": "running" }
],
"gpu_utilization": 78,
"memory_used_gb": 24,
"memory_total_gb": 64,
"storage": {
"path": "/data",
"total_gb": 500,
"used_gb": 120,
"usage_percent": 24
}
}
}
}If the control plane doesn't receive a heartbeat for 60 seconds, the node is marked offline and its replicas are rescheduled.
Container Lifecycle
When the agent receives a deploy command:
async function handleDeploy(payload: DeployPayload) {
const { replica_id, image, gpu_count, env_vars } = payload;
try {
// Build environment flags
const envFlags = Object.entries(env_vars || {})
.map(([k, v]) => `-e ${k}=${v}`)
.join(" ");
// Run with Kata runtime for isolation
const cmd = `nerdctl run -d \
--runtime=io.containerd.kata.v2 \
${envFlags} \
--name replica-${replica_id} \
${image}`;
await execAsync(cmd);
// Report success
send({
type: "deploy_result",
payload: { replica_id, status: "running" }
});
} catch (err) {
// Report failure
send({
type: "deploy_result",
payload: {
replica_id,
status: "failed",
error: err.message
}
});
}
}Reconnection Handling
Network hiccups happen. The agent handles disconnections gracefully:
- →On disconnect, wait 5 seconds then reconnect
- →Re-send system_info to update status
- →Running containers continue operating during disconnection
- →Pending jobs are picked up on reconnection
ws.on("close", (code, reason) => {
log("warn", `Disconnected: ${code} - ${reason}`);
stopHeartbeat();
// Reconnect after delay
setTimeout(() => {
connect();
}, 5000);
});Continue Reading
The Big Picture
Overview of the distributed GPU overlay network
The Control Plane
WebSocket server, node registration, and job queues
The Agent(You are here)
Node software, heartbeats, and dependency management
GPU Passthrough with Kata
VFIO, microVMs, and hardware isolation
Fault Tolerance
Health checks, failover, and self-healing
GPU Passthrough with Kata
How we use VFIO and Kata Containers to give workloads direct GPU access while maintaining strong isolation.
Continue reading