Fault Tolerance
This is where everything comes together. Fault tolerance isn't a feature we bolted on—it's baked into every layer of Vectorlay. Nodes fail, networks hiccup, and your inference keeps running.
Design Philosophy: Assume Failure
Consumer GPUs aren't enterprise hardware. Machines go offline, power flickers, networks partition. Our design principle is simple: every component should assume every other component can fail at any time.
This manifests in several ways:
- →Nodes send heartbeats; missing heartbeats trigger failover
- →Replicas are spread across nodes for redundancy
- →Jobs are queued in Redis, surviving WS server restarts
- →The edge proxy only routes to verified-healthy replicas
Heartbeat Monitoring
Every 30 seconds, agents send a heartbeat to the control plane. If 60 seconds pass without a heartbeat, the node is marked offline:
// Heartbeat monitor - runs every 30 seconds
setInterval(async () => {
const timeout = 60 * 1000; // 60 seconds
const now = new Date();
for (const [nodeId, agent] of agents) {
const elapsed = now.getTime() - agent.lastHeartbeat.getTime();
if (elapsed > timeout) {
console.log(`⚠️ Node ${nodeId} heartbeat timeout`);
// Close the stale connection
agent.ws.close(4001, "Heartbeat timeout");
agents.delete(nodeId);
// Mark node offline in database
await db.update(nodes)
.set({ status: "offline" })
.where(eq(nodes.id, nodeId));
}
}
}, 30 * 1000);When a node goes offline, its replicas don't immediately fail. The edge proxy stops routing new traffic to them, and the scheduler begins provisioning replacements on other nodes.
Replica Health Checks
Beyond node status, we track individual replica health. Each replica has:
// Replica health fields
{
healthy: boolean, // Current health status
lastHealthCheckAt: timestamp, // When we last checked
consecutiveFailures: number, // Failed checks in a row
}Health checks run at the application level:
- →HTTP endpoint check (configurable path, default
/health) - →Expected response within timeout (default 5s)
- →3 consecutive failures → mark unhealthy
- →1 success after failures → mark healthy again
The edge proxy only routes traffic to replicas where healthy = true. Unhealthy replicas are taken out of rotation until they recover.
Cluster Status States
Clusters aggregate replica health into overall status:
┌────────────────────────────────────────────────────────────┐ │ Cluster Statuses │ ├────────────────────────────────────────────────────────────┤ │ pending Initial state, waiting for scheduling │ │ deploying Replicas being created │ │ running All replicas healthy ✓ │ │ degraded Some replicas failed, but cluster operational │ │ failed All replicas failed │ │ stopped Manually stopped │ └────────────────────────────────────────────────────────────┘
The key status is degraded. A degraded cluster is still serving traffic through its healthy replicas—it's just running at reduced capacity.
// Status calculation after deploy result
const allReplicas = await db.select()
.from(replicas)
.where(eq(replicas.clusterId, cluster.id));
const allRunning = allReplicas.every(r => r.status === "running");
const anyFailed = allReplicas.some(r => r.status === "failed");
const allFailed = allReplicas.every(r => r.status === "failed");
let newStatus;
if (allFailed) {
newStatus = "failed";
} else if (anyFailed) {
newStatus = "degraded"; // Still serving, just reduced capacity
} else if (allRunning) {
newStatus = "running";
} else {
newStatus = "deploying";
}
await db.update(clusters)
.set({ status: newStatus })
.where(eq(clusters.id, cluster.id));Automatic Recovery
When replicas fail, the system automatically attempts recovery:
Failure Detected
│
▼
┌──────────────────────────┐
│ Mark replica unhealthy │
│ Stop routing traffic │
└────────────┬─────────────┘
│
▼
┌──────────────────────────┐
│ Attempt local restart │◀──── Same node, if healthy
└────────────┬─────────────┘
│
▼ (if restart fails or node offline)
┌──────────────────────────┐
│ Schedule on new node │◀──── Find healthy node with capacity
└────────────┬─────────────┘
│
▼
┌──────────────────────────┐
│ Deploy to new node │
│ Update cluster status │
└──────────────────────────┘This happens automatically—no manual intervention required. The goal is for users to never notice when a node fails.
Replica Distribution Strategy
When scheduling replicas, we actively avoid placing multiple replicas of the same cluster on the same node:
- →Anti-affinity: Replicas spread across different nodes when possible
- →Region awareness: Prefer nodes in different locations for geo-redundancy
- →Capacity matching: Only schedule on nodes with available GPU slots
This means a single node failure should never take down your entire service—other replicas on other nodes keep serving.
Edge Proxy: The Traffic Gate
The edge proxy is the final line of defense. It only routes traffic to replicas that pass all checks:
// Routing decision pseudocode
function getHealthyReplicas(clusterId: string): Replica[] {
return db.select()
.from(replicas)
.where(
and(
eq(replicas.clusterId, clusterId),
eq(replicas.status, "running"),
eq(replicas.healthy, true),
// Node must also be online
exists(
db.select()
.from(nodes)
.where(
and(
eq(nodes.id, replicas.nodeId),
eq(nodes.status, "online")
)
)
)
)
);
}Traffic is load-balanced across healthy replicas. If all replicas are unhealthy, requests receive a 503 (Service Unavailable) while the system attempts recovery.
Visibility and Monitoring
All of this is visible in the dashboard:
- →Cluster status: Real-time view of running, degraded, or failed states
- →Replica details: See which node each replica runs on, health status, uptime
- →Node health: Provider view of all nodes, their status, and dependency issues
- →Event log: Timeline of failures and automatic recovery actions
Putting It All Together
The fault tolerance system works in layers:
Layer 1: Node Level └── Heartbeats detect disconnection in ≤60s └── Immediate status update, stop new work Layer 2: Replica Level └── Health checks every 10s └── 3 failures → remove from rotation └── Auto-restart or reschedule Layer 3: Cluster Level └── Aggregates replica health └── Degraded status keeps serving └── Full failure triggers alerts Layer 4: Edge Proxy └── Only routes to verified-healthy replicas └── Instant failover on health change └── No traffic to unhealthy nodes
The result: nodes fail, and your users don't notice.
You've reached the end!
You now understand how Vectorlay works from top to bottom—the overlay network architecture, control plane, agent software, GPU passthrough with Kata, and fault tolerance mechanisms.
Try it yourselfFull Series
The Big Picture
Overview of the distributed GPU overlay network
The Control Plane
WebSocket server, node registration, and job queues
The Agent
Node software, heartbeats, and dependency management
GPU Passthrough with Kata
VFIO, microVMs, and hardware isolation
Fault Tolerance(You are here)
Health checks, failover, and self-healing