Fault Tolerance: Health Checks, Failover, and Self-Healing

Design Philosophy: Assume Failure

Consumer GPUs aren't enterprise hardware. Machines go offline, power flickers, networks partition. Our design principle is simple: every component should assume every other component can fail at any time.

This manifests in several ways:

→Nodes send heartbeats; missing heartbeats trigger failover
→Replicas are spread across nodes for redundancy
→Jobs are queued in Redis, surviving WS server restarts
→The edge proxy only routes to verified-healthy replicas

Heartbeat Monitoring

Every 30 seconds, agents send a heartbeat to the control plane. If 60 seconds pass without a heartbeat, the node is marked offline:

// Heartbeat monitor - runs every 30 seconds
setInterval(async () => {
  const timeout = 60 * 1000; // 60 seconds
  const now = new Date();

  for (const [nodeId, agent] of agents) {
    const elapsed = now.getTime() - agent.lastHeartbeat.getTime();
    
    if (elapsed > timeout) {
      console.log(`⚠️ Node ${nodeId} heartbeat timeout`);
      
      // Close the stale connection
      agent.ws.close(4001, "Heartbeat timeout");
      agents.delete(nodeId);

      // Mark node offline in database
      await db.update(nodes)
        .set({ status: "offline" })
        .where(eq(nodes.id, nodeId));
    }
  }
}, 30 * 1000);

When a node goes offline, its replicas don't immediately fail. The edge proxy stops routing new traffic to them, and the scheduler begins provisioning replacements on other nodes.

Replica Health Checks

Beyond node status, we track individual replica health. Each replica has:

// Replica health fields
{
  healthy: boolean,              // Current health status
  lastHealthCheckAt: timestamp,  // When we last checked
  consecutiveFailures: number,   // Failed checks in a row
}

Health checks run at the application level:

→HTTP endpoint check (configurable path, default /health)
→Expected response within timeout (default 5s)
→3 consecutive failures → mark unhealthy
→1 success after failures → mark healthy again

The edge proxy only routes traffic to replicas where healthy = true. Unhealthy replicas are taken out of rotation until they recover.

Cluster Status States

Clusters aggregate replica health into overall status:

┌────────────────────────────────────────────────────────────┐
│                    Cluster Statuses                        │
├────────────────────────────────────────────────────────────┤
│  pending     Initial state, waiting for scheduling         │
│  deploying   Replicas being created                        │
│  running     All replicas healthy ✓                        │
│  degraded    Some replicas failed, but cluster operational │
│  failed      All replicas failed                           │
│  stopped     Manually stopped                              │
└────────────────────────────────────────────────────────────┘

The key status is degraded. A degraded cluster is still serving traffic through its healthy replicas—it's just running at reduced capacity.

// Status calculation after deploy result
const allReplicas = await db.select()
  .from(replicas)
  .where(eq(replicas.clusterId, cluster.id));

const allRunning = allReplicas.every(r => r.status === "running");
const anyFailed = allReplicas.some(r => r.status === "failed");
const allFailed = allReplicas.every(r => r.status === "failed");

let newStatus;
if (allFailed) {
  newStatus = "failed";
} else if (anyFailed) {
  newStatus = "degraded";  // Still serving, just reduced capacity
} else if (allRunning) {
  newStatus = "running";
} else {
  newStatus = "deploying";
}

await db.update(clusters)
  .set({ status: newStatus })
  .where(eq(clusters.id, cluster.id));

Automatic Recovery

When replicas fail, the system automatically attempts recovery:

Failure Detected
       │
       ▼
┌──────────────────────────┐
│  Mark replica unhealthy  │
│  Stop routing traffic    │
└────────────┬─────────────┘
             │
             ▼
┌──────────────────────────┐
│  Attempt local restart   │◀──── Same node, if healthy
└────────────┬─────────────┘
             │
             ▼ (if restart fails or node offline)
┌──────────────────────────┐
│  Schedule on new node    │◀──── Find healthy node with capacity
└────────────┬─────────────┘
             │
             ▼
┌──────────────────────────┐
│  Deploy to new node      │
│  Update cluster status   │
└──────────────────────────┘

This happens automatically—no manual intervention required. The goal is for users to never notice when a node fails.

Replica Distribution Strategy

When scheduling replicas, we actively avoid placing multiple replicas of the same cluster on the same node:

→Anti-affinity: Replicas spread across different nodes when possible
→Region awareness: Prefer nodes in different locations for geo-redundancy
→Capacity matching: Only schedule on nodes with available GPU slots

This means a single node failure should never take down your entire service—other replicas on other nodes keep serving.

Edge Proxy: The Traffic Gate

The edge proxy is the final line of defense. It only routes traffic to replicas that pass all checks:

// Routing decision pseudocode
function getHealthyReplicas(clusterId: string): Replica[] {
  return db.select()
    .from(replicas)
    .where(
      and(
        eq(replicas.clusterId, clusterId),
        eq(replicas.status, "running"),
        eq(replicas.healthy, true),
        // Node must also be online
        exists(
          db.select()
            .from(nodes)
            .where(
              and(
                eq(nodes.id, replicas.nodeId),
                eq(nodes.status, "online")
              )
            )
        )
      )
    );
}

Traffic is load-balanced across healthy replicas. If all replicas are unhealthy, requests receive a 503 (Service Unavailable) while the system attempts recovery.

Visibility and Monitoring

All of this is visible in the dashboard:

→Cluster status: Real-time view of running, degraded, or failed states
→Replica details: See which node each replica runs on, health status, uptime
→Node health: Provider view of all nodes, their status, and dependency issues
→Event log: Timeline of failures and automatic recovery actions

Putting It All Together

The fault tolerance system works in layers:

Layer 1: Node Level
  └── Heartbeats detect disconnection in ≤60s
  └── Immediate status update, stop new work
  
Layer 2: Replica Level  
  └── Health checks every 10s
  └── 3 failures → remove from rotation
  └── Auto-restart or reschedule

Layer 3: Cluster Level
  └── Aggregates replica health
  └── Degraded status keeps serving
  └── Full failure triggers alerts

Layer 4: Edge Proxy
  └── Only routes to verified-healthy replicas
  └── Instant failover on health change
  └── No traffic to unhealthy nodes

The result: nodes fail, and your users don't notice.

🎉 Series Complete

You've reached the end!

You now understand how Vectorlay works from top to bottom—the overlay network architecture, control plane, agent software, GPU passthrough with Kata, and fault tolerance mechanisms.

Try it yourself

Full Series

The Big Picture

Overview of the distributed GPU overlay network

The Control Plane

WebSocket server, node registration, and job queues

The Agent

Node software, heartbeats, and dependency management

GPU Passthrough with Kata

VFIO, microVMs, and hardware isolation

Fault Tolerance(You are here)

Health checks, failover, and self-healing