LLM Inference at Scale
Run Llama 3, Mistral, Mixtral, and other open-source models on distributed GPUs. This guide covers deployment, optimization, and production patterns.
Why Self-Host LLMs?
OpenAI and Anthropic are great, but there are compelling reasons to self-host:
- Cost: At scale, self-hosting is 10-50x cheaper per token
- Privacy: Your data never leaves your infrastructure
- Customization: Fine-tune models for your specific use case
- Control: No rate limits, no API changes, no vendor lock-in
- Latency: Deploy closer to your users
Choosing the Right Model
The LLM landscape moves fast. Here are our current recommendations:
| Model | VRAM | Best For | GPU Rec |
|---|---|---|---|
| Llama 3.1 8B | ~16GB | General purpose, fast | RTX 4090 |
| Llama 3.1 70B | ~140GB | Complex reasoning | 2-4x A100 |
| Mistral 7B | ~14GB | Speed + quality balance | RTX 4090 |
| Mixtral 8x7B | ~90GB | MoE efficiency | 2x A100 |
| Qwen 2.5 72B | ~145GB | Multilingual, coding | 2-4x A100 |
| DeepSeek Coder 33B | ~66GB | Code generation | 2x RTX 4090 |
Deployment with vLLM
vLLM is the gold standard for LLM serving. It provides PagedAttention for efficient memory management, continuous batching for high throughput, and tensor parallelism for large models.
Dockerfile
FROM vllm/vllm-openai:latest
# Pre-download model weights
RUN python -c "from huggingface_hub import snapshot_download; \
snapshot_download('meta-llama/Meta-Llama-3.1-8B-Instruct')"
ENV MODEL_NAME="meta-llama/Meta-Llama-3.1-8B-Instruct"
ENV MAX_MODEL_LEN=8192
ENV GPU_MEMORY_UTILIZATION=0.95
EXPOSE 8080
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "${MODEL_NAME}", \
"--port", "8080", \
"--max-model-len", "${MAX_MODEL_LEN}", \
"--gpu-memory-utilization", "${GPU_MEMORY_UTILIZATION}"]Deploying on VectorLay
import vectorlay
client = vectorlay.Client(api_key="YOUR_API_KEY")
cluster = client.clusters.create(
name="llama-3-8b",
image="your-registry/vllm-llama:latest",
gpu_type="rtx-4090",
replicas=3,
port=8080,
health_check_path="/health",
env={
"HUGGING_FACE_HUB_TOKEN": "hf_xxx", # For gated models
}
)
cluster.wait_until_ready()Making Requests
vLLM exposes an OpenAI-compatible API, so you can use any OpenAI SDK:
from openai import OpenAI
client = OpenAI(
base_url="https://your-cluster.vectorlay.dev/v1",
api_key="YOUR_VECTORLAY_API_KEY"
)
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in simple terms."}
],
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)Streaming Responses
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Write a poem about distributed systems"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")Performance Benchmarks
We benchmarked Llama 3.1 8B on VectorLay infrastructure:
| Metric | RTX 4090 | RTX 3090 | A100 40GB |
|---|---|---|---|
| Tokens/second | 85 | 52 | 110 |
| Time to first token | 42ms | 68ms | 35ms |
| Max concurrent users | 150+ | 80 | 200+ |
| Cost per 1M tokens | $0.10 | $0.12 | $0.25 |
Benchmarks: Llama 3.1 8B Instruct, vLLM 0.5.x, context length 4096, 50 concurrent requests
Cost Comparison
Let's compare the cost of processing 10 million tokens (roughly 7.5 million words):
| Provider | Model | Cost (10M tokens) |
|---|---|---|
| OpenAI | GPT-4o | $50.00 |
| OpenAI | GPT-4o-mini | $1.50 |
| Anthropic | Claude 3 Sonnet | $30.00 |
| Together AI | Llama 3.1 8B | $1.80 |
| VectorLay | Llama 3.1 8B | $1.00 |
Production Patterns
High Availability
For production workloads, run multiple replicas across different nodes:
cluster = client.clusters.create(
name="llama-prod",
image="your-registry/vllm-llama:latest",
gpu_type="rtx-4090",
replicas=5, # 5 replicas for redundancy
# VectorLay automatically spreads across different physical nodes
)Graceful Degradation
If a node fails, VectorLay automatically routes traffic to healthy replicas and schedules a replacement. Your application sees increased latency for a few seconds, but no errors.
Request Retries
import httpx
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=1, max=10))
async def generate_completion(prompt: str) -> str:
async with httpx.AsyncClient() as client:
response = await client.post(
"https://your-cluster.vectorlay.dev/v1/chat/completions",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": prompt}],
},
timeout=60.0
)
response.raise_for_status()
return response.json()["choices"][0]["message"]["content"]Running Larger Models
For models that don't fit on a single GPU (like Llama 70B or Mixtral), use tensor parallelism. VectorLay supports multi-GPU deployments:
# Dockerfile for Llama 70B on 2x A100
FROM vllm/vllm-openai:latest
ENV MODEL_NAME="meta-llama/Meta-Llama-3.1-70B-Instruct"
ENV TENSOR_PARALLEL_SIZE=2
CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
"--model", "${MODEL_NAME}", \
"--tensor-parallel-size", "${TENSOR_PARALLEL_SIZE}", \
"--port", "8080"]cluster = client.clusters.create(
name="llama-70b",
image="your-registry/vllm-llama-70b:latest",
gpu_type="a100-80gb",
gpus_per_replica=2, # 2 GPUs per replica for tensor parallelism
replicas=2
)Real-World Use Cases
- Customer support: AI agents that handle 90%+ of tickets
- Code assistants: IDE integrations for code completion and review
- Content generation: Blog posts, product descriptions, marketing copy
- Data extraction: Parse unstructured documents into structured data
- Search & RAG: Combine with vector databases for retrieval-augmented generation
Next Steps
Start building with LLMs
Get $10 in free credits—enough to process ~100 million tokens.
Start Building→