GPU Passthrough with Kata Containers: VFIO and Hardware Isolation

The Problem: Container Isolation Isn't Enough

Standard Docker containers share the host kernel. While Linux namespaces and cgroups provide reasonable isolation for most workloads, they're not suitable when running untrusted code on expensive hardware.

In Vectorlay's model, anyone can submit workloads to run on provider GPUs. We need:

→Hardware-level isolation: Workloads can't escape to the host or access other workloads
→Direct GPU access: Near-native performance, no virtualization overhead on the GPU itself
→Container ergonomics: Users still push OCI images, same workflow they know

Kata Containers: VMs with Container UX

Kata Containers run each container inside a lightweight VM. You get the security of virtualization with the developer experience of containers.

┌─────────────────────────────────────────────────────────┐
│                     Host System                         │
│                                                         │
│  ┌───────────────────────┐  ┌───────────────────────┐   │
│  │     Kata MicroVM      │  │     Kata MicroVM      │   │
│  │  ┌─────────────────┐  │  │  ┌─────────────────┐  │   │
│  │  │   Container A   │  │  │  │   Container B   │  │   │
│  │  │                 │  │  │  │                 │  │   │
│  │  │  Your workload  │  │  │  │  Other workload │  │   │
│  │  └─────────────────┘  │  │  └─────────────────┘  │   │
│  │                       │  │                       │   │
│  │  Guest Kernel         │  │  Guest Kernel         │   │
│  └───────────────────────┘  └───────────────────────┘   │
│              │                          │               │
│         QEMU/KVM                   QEMU/KVM             │
│              │                          │               │
│  ────────────┴──────────────────────────┴───────────    │
│                     Host Kernel                         │
└─────────────────────────────────────────────────────────┘

Key characteristics:

→Separate kernel: Each container runs its own guest Linux kernel
→Fast boot: Optimized guest images boot in ~100ms
→OCI compatible: Works with any container image, integrates with containerd
→Hardware passthrough: PCIe devices can be passed directly to the VM

VFIO: Direct GPU Access

VFIO (Virtual Function I/O) is a Linux kernel framework that allows userspace programs to directly access PCIe devices. We use it to pass GPUs directly to Kata VMs.

┌────────────────────────────────────────────────────┐
│                     Host System                    │
│  ┌──────────────────────────────────────────────┐  │
│  │              Kata MicroVM                     │  │
│  │  ┌────────────────────────────────────────┐  │  │
│  │  │         Your Container                 │  │  │
│  │  │                                        │  │  │
│  │  │   nvidia-smi  ────▶  RTX 4090         │  │  │
│  │  │   (direct access via VFIO)            │  │  │
│  │  └────────────────────────────────────────┘  │  │
│  │                                              │  │
│  │  Guest NVIDIA Driver (545, 550, etc.)        │  │
│  └──────────────────────────────────────────────┘  │
│                        │                           │
│              VFIO passthrough                      │
│              (IOMMU translation)                   │
│                        │                           │
│  ┌──────────────────────────────────────────────┐  │
│  │              Physical GPU                    │  │
│  │              NVIDIA RTX 4090                 │  │
│  └──────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────┘

Why Not nvidia-docker?

The traditional approach (nvidia-docker) shares the host's GPU drivers with containers. This has problems for our use case:

❌ Shared kernel attack surface

Container escapes can access host NVIDIA drivers and potentially other GPUs.

❌ Driver version lock-in

All containers must use the host's driver version. Can't run different CUDA versions on the same machine.

❌ Resource interference

Containers can see and potentially interfere with each other's GPU processes.

With VFIO passthrough:

✓ Complete isolation

Each VM has exclusive access to its assigned GPU. No shared drivers, no shared memory.

✓ Per-workload drivers

Each VM runs its own NVIDIA driver version. CUDA 11 and CUDA 12 can run on the same host.

✓ Native performance

The GPU isn't virtualized—it's passed directly to the VM. Same performance as bare metal.

Custom Guest Images

Kata Containers need a guest kernel and rootfs image. We build custom images with everything needed for GPU workloads:

# Guest image contents
├── Ubuntu 22.04 (minimal)
├── NVIDIA Driver (545, 550, etc.)
├── containerd
├── nvidia-container-toolkit
└── virtiofsd (for filesystem passthrough)

Driver Version Matrix

Different GPU generations need different driver versions. We maintain images for multiple driver versions:

Driver	CUDA	Supported GPUs
535	12.2	RTX 30xx, 40xx
545	12.3	RTX 30xx, 40xx
550	12.4	RTX 30xx, 40xx, 50xx

Build Process

We use libguestfs to build custom guest images:

#!/bin/bash
# build-guest.sh

# Start with Ubuntu cloud image
wget https://cloud-images.ubuntu.com/jammy/current/jammy-server-cloudimg-amd64.img

# Resize for our needs
qemu-img resize jammy-server-cloudimg-amd64.img 15G

# Customize with virt-customize
virt-customize -a jammy-server-cloudimg-amd64.img \
  --install containerd,nvidia-driver-$DRIVER_VERSION \
  --run-command "nvidia-ctk runtime configure --runtime=containerd" \
  --selinux-relabel

# Output: kata-gpu-guest-$DRIVER_VERSION.raw

Host Configuration

Key insight: the host doesn't need NVIDIA drivers. GPUs are bound to VFIO at boot, and the guest VM runs its own drivers.

IOMMU Setup

VFIO requires IOMMU (Intel VT-d or AMD-Vi) for memory translation:

# /etc/default/grub
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt"

# Load VFIO modules before NVIDIA
# /etc/modules-load.d/vfio.conf
vfio
vfio_iommu_type1
vfio_pci

# Bind GPU to VFIO at boot
# /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:2684,10de:22ba

Kata Configuration

# /etc/kata-containers/configuration.toml

[hypervisor.qemu]
path = "/usr/bin/qemu-system-x86_64"
kernel = "/usr/share/kata-containers/vmlinux.container"
image = "/opt/vectorlay/kata-gpu-guest-545.raw"

# Enable VFIO for GPU passthrough
hotplug_vfio_on_root_bus = true
pcie_root_port = 2

# Memory and CPU defaults
default_memory = 8192
default_vcpus = 4

Performance Characteristics

With VFIO passthrough, GPU operations run at native speed. The overhead is only in:

→VM boot: ~100-200ms for the microVM to start
→CPU operations: ~2-5% overhead for virtualized CPU
→Memory access: Negligible with EPT/NPT
→GPU compute: Zero overhead—direct hardware access

For inference workloads where GPU time dominates, total overhead is typically under 1%.

Continue Reading

The Big Picture

Overview of the distributed GPU overlay network

The Control Plane

WebSocket server, node registration, and job queues

The Agent

Node software, heartbeats, and dependency management

GPU Passthrough with Kata(You are here)

VFIO, microVMs, and hardware isolation

Fault Tolerance

Health checks, failover, and self-healing

Next in series

Fault Tolerance

How Vectorlay detects failures, routes around unhealthy nodes, and keeps your inference workloads running.