Back to blog
Architecture SeriesPart 3 of 5

The Agent

December 27, 2024
7 min read

The agent is a lightweight daemon that runs on every GPU node in the network. It's the bridge between the control plane and the actual hardware—handling everything from registration to container lifecycle management.

What the Agent Does

The agent is responsible for:

  • Maintaining a persistent WebSocket connection to the control plane
  • Reporting hardware specs (CPU, RAM, GPUs, storage)
  • Checking and reporting dependency status
  • Sending heartbeats every 30 seconds
  • Executing deploy/stop commands via containerd
  • Reporting deployment results back to the control plane

Startup Flow

When the agent starts, it follows this sequence:

┌─────────────────────────────────────────────────────────────┐
│                     Agent Startup                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Check for saved node config (/var/lib/vectorlay/)       │
│     │                                                       │
│     ├── Found → Load node ID, connect directly              │
│     │                                                       │
│     └── Not found → Check for org token                     │
│           │                                                 │
│           ├── Token set → Register as new node              │
│           │                                                 │
│           └── No token → Exit with error                    │
│                                                             │
│  2. Connect to control plane via WebSocket                  │
│                                                             │
│  3. Send system_info (hardware + dependencies)              │
│                                                             │
│  4. Start heartbeat loop (every 30s)                        │
│                                                             │
│  5. Listen for commands (deploy, stop)                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Hardware Detection

On startup and reconnection, the agent gathers hardware info:

async function getHardwareInfo() {
  const hostname = os.hostname();
  const cpuModel = os.cpus()[0]?.model || "Unknown";
  const cpuCores = os.cpus().length;
  const ramGb = Math.round(os.totalmem() / (1024 * 1024 * 1024));
  
  // Detect GPUs via lspci (works even with VFIO passthrough)
  let gpuInfo: string[] = [];
  try {
    const { stdout } = await execAsync(
      "lspci | grep -i 'vga\\|3d\\|display' | grep -i 'nvidia\\|amd'"
    );
    gpuInfo = stdout.trim().split("\n").filter(Boolean);
  } catch {
    // No GPUs found or lspci not available
  }

  return { hostname, cpuModel, cpuCores, ramGb, gpuInfo };
}

Note that we use lspci instead of nvidia-smi. This is intentional—when GPUs are bound to VFIO for passthrough, they're not visible to host NVIDIA drivers, but lspci still sees them at the PCI level.

Dependency Checking

Not all nodes are created equal. Some might be missing dependencies. The agent checks for required components on every connection:

// Required dependencies for GPU workloads
containerd   ✓  1.7.2      // Container runtime
nerdctl      ✓  1.5.0      // CLI for containerd
kata-runtime ✓  3.2.0      // MicroVM runtime
virtiofsd    ✓  1.8.0      // Filesystem passthrough
vfio         ✓  4 modules  // GPU passthrough kernel modules

The check results determine node status:

ready = true → online

All core dependencies installed. Node can accept deployments.

ready = false → degraded

Connected but missing dependencies. Won't receive deployments until fixed.

Node Status State Machine

┌─────────────────────────────────────────────────────────┐
│                     Node Statuses                       │
├─────────────────────────────────────────────────────────┤
│  online      Connected, deps ready, accepting deploys   │
│  offline     Socket disconnected                        │
│  degraded    Connected but missing dependencies         │
│  draining    Finishing work, no new deploys             │
│  maintenance Manual maintenance mode                    │
└─────────────────────────────────────────────────────────┘

State Transitions:
─────────────────
offline ──(connect)──▶ degraded ──(deps ok)──▶ online
online ──(disconnect)──▶ offline
online ──(deps fail)──▶ degraded
online ──(operator)──▶ draining ──▶ maintenance

Heartbeat System

Every 30 seconds, the agent sends a heartbeat with current status:

// Heartbeat payload
{
  "type": "heartbeat",
  "payload": {
    "status": {
      "containers": [
        { "replica_id": "abc-123", "status": "running" },
        { "replica_id": "def-456", "status": "running" }
      ],
      "gpu_utilization": 78,
      "memory_used_gb": 24,
      "memory_total_gb": 64,
      "storage": {
        "path": "/data",
        "total_gb": 500,
        "used_gb": 120,
        "usage_percent": 24
      }
    }
  }
}

If the control plane doesn't receive a heartbeat for 60 seconds, the node is marked offline and its replicas are rescheduled.

Container Lifecycle

When the agent receives a deploy command:

async function handleDeploy(payload: DeployPayload) {
  const { replica_id, image, gpu_count, env_vars } = payload;
  
  try {
    // Build environment flags
    const envFlags = Object.entries(env_vars || {})
      .map(([k, v]) => `-e ${k}=${v}`)
      .join(" ");

    // Run with Kata runtime for isolation
    const cmd = `nerdctl run -d \
      --runtime=io.containerd.kata.v2 \
      ${envFlags} \
      --name replica-${replica_id} \
      ${image}`;
    
    await execAsync(cmd);
    
    // Report success
    send({
      type: "deploy_result",
      payload: { replica_id, status: "running" }
    });
  } catch (err) {
    // Report failure
    send({
      type: "deploy_result",
      payload: { 
        replica_id, 
        status: "failed",
        error: err.message 
      }
    });
  }
}

Reconnection Handling

Network hiccups happen. The agent handles disconnections gracefully:

  • On disconnect, wait 5 seconds then reconnect
  • Re-send system_info to update status
  • Running containers continue operating during disconnection
  • Pending jobs are picked up on reconnection
ws.on("close", (code, reason) => {
  log("warn", `Disconnected: ${code} - ${reason}`);
  stopHeartbeat();
  
  // Reconnect after delay
  setTimeout(() => {
    connect();
  }, 5000);
});

Continue Reading

Next in series

GPU Passthrough with Kata

How we use VFIO and Kata Containers to give workloads direct GPU access while maintaining strong isolation.

Continue reading