Real-Time AI Inference
Some applications can't wait 2 seconds for a response. Gaming, live video processing, trading systems—they need AI that responds in milliseconds. Here's how to build them.
The Latency Budget
Real-time means different things to different applications:
- <16ms: Interactive games (60fps frame time)
- <100ms: Live video, voice assistants, autocomplete
- <200ms: Chat, recommendations, content moderation
- <500ms: Search, analytics dashboards
Your latency budget gets split across: network RTT, queue wait, model inference, and response serialization. On a distributed platform, minimizing each component matters.
Example: Real-Time Object Detection
Let's build a live video analysis system that detects objects in real-time. The goal: process 30 frames per second with detection results returned before the next frame arrives.
The Model: YOLOv8
YOLOv8 is perfect for real-time detection. On an RTX 4090, it processes a 640x640 image in ~5ms at FP16 precision.
FROM nvidia/cuda:12.1-runtime-ubuntu22.04
RUN pip install ultralytics fastapi uvicorn python-multipart
COPY server.py /app/server.py
WORKDIR /app
# Pre-download model
RUN python -c "from ultralytics import YOLO; YOLO('yolov8n.pt')"
EXPOSE 8080
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8080"]# server.py
import torch
from fastapi import FastAPI, File, UploadFile
from ultralytics import YOLO
import numpy as np
import cv2
from io import BytesIO
app = FastAPI()
# Load model at startup
model = YOLO('yolov8n.pt')
model.to('cuda')
# Warmup
dummy = np.zeros((640, 640, 3), dtype=np.uint8)
model(dummy)
@app.post("/detect")
async def detect(file: UploadFile = File(...)):
# Read image
contents = await file.read()
nparr = np.frombuffer(contents, np.uint8)
img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
# Run inference
results = model(img, verbose=False)[0]
# Extract detections
detections = []
for box in results.boxes:
detections.append({
"class": results.names[int(box.cls)],
"confidence": float(box.conf),
"bbox": box.xyxy[0].tolist()
})
return {"detections": detections}
@app.get("/health")
async def health():
return {"status": "healthy"}Deployment
cluster = client.clusters.create(
name="yolo-realtime",
image="your-registry/yolov8-server:latest",
gpu_type="rtx-4090",
replicas=4, # More replicas = lower queue wait
port=8080,
health_check_path="/health"
)Optimization Strategies
1. Model Optimization
- Use TensorRT: NVIDIA's inference optimizer can 2-3x your throughput
- Quantize to INT8: ~2x faster with minimal accuracy loss for many models
- Use smaller model variants: YOLOv8n (nano) vs YOLOv8x (extra-large)
- Batch requests: Process multiple frames together when possible
# Export to TensorRT
from ultralytics import YOLO
model = YOLO('yolov8n.pt')
model.export(format='engine', half=True, device=0)
# Use the TensorRT model
model = YOLO('yolov8n.engine')2. Reduce Network Latency
- Deploy close to users: VectorLay has nodes in multiple regions
- Use efficient serialization: Protobuf or MessagePack instead of JSON
- Compress payloads: Especially for image/video data
- Keep connections alive: HTTP/2 or WebSockets eliminate connection overhead
3. Eliminate Queue Wait
The biggest latency killer is waiting in a queue behind other requests. Solutions:
- Overprovision: Run more replicas than your steady-state needs
- Use smaller batches: Lower throughput but faster individual requests
- Implement request priorities: Skip the queue for time-sensitive requests
WebSocket Streaming
For continuous real-time applications (like live video), use WebSockets to eliminate per-request overhead:
# server.py with WebSocket support
from fastapi import FastAPI, WebSocket
import asyncio
@app.websocket("/stream")
async def websocket_endpoint(websocket: WebSocket):
await websocket.accept()
while True:
# Receive frame
data = await websocket.receive_bytes()
# Decode and process
nparr = np.frombuffer(data, np.uint8)
img = cv2.imdecode(nparr, cv2.IMREAD_COLOR)
results = model(img, verbose=False)[0]
# Send results
detections = [...] # Extract as before
await websocket.send_json({"detections": detections})# Client-side streaming
import asyncio
import websockets
import cv2
async def stream_video():
uri = "wss://your-cluster.vectorlay.dev/stream"
async with websockets.connect(uri) as websocket:
cap = cv2.VideoCapture(0) # Webcam
while True:
ret, frame = cap.read()
if not ret:
break
# Send frame
_, buffer = cv2.imencode('.jpg', frame)
await websocket.send(buffer.tobytes())
# Receive detections
response = await websocket.recv()
detections = json.loads(response)
# Draw bounding boxes and display
for det in detections["detections"]:
# ... draw boxes
cv2.imshow('Live Detection', frame)
asyncio.run(stream_video())Latency Breakdown
Here's a typical latency breakdown for a real-time inference request:
| Component | Unoptimized | Optimized |
|---|---|---|
| Network RTT | 50-200ms | 10-30ms |
| Request parsing | 5-10ms | 1-2ms |
| Queue wait | 0-500ms | 0-10ms |
| Model inference | 20-50ms | 5-15ms |
| Response serialization | 5-10ms | 1-2ms |
| Total | 80-770ms | 17-59ms |
Use Cases
- Gaming: Real-time object detection for player assistance, anti-cheat, or NPC behavior
- Live video: Content moderation, face blur, automatic captions
- Robotics: Vision processing for autonomous navigation
- Financial: Real-time sentiment analysis, fraud detection
- IoT: Edge inference for security cameras, industrial sensors
Monitoring Latency
Track these metrics to ensure you're hitting your latency targets:
- p50, p95, p99 latency: Median and tail latencies
- Queue depth: How many requests are waiting
- Inference time: Pure model execution time
- GPU utilization: Are you leaving performance on the table?
VectorLay's dashboard shows all these metrics in real-time, and you can set up alerts when latency exceeds your thresholds.