Deploy Your First Model on VectorLay
Go from zero to running inference in under 10 minutes. This guide walks you through creating an account, deploying a cluster, and making your first API call.
What You'll Need
- • A VectorLay account (free to sign up)
- • A Docker image with your model (or use our examples)
- • Basic familiarity with REST APIs
Step 1: Create Your Account
Head to app.vectorlay.com and sign up with your email or GitHub account. You'll get $10 in free credits to start—enough to run a small model for several hours.
Once you're in, you'll land on the dashboard. This is your command center for managing clusters, viewing usage, and generating API keys.
Step 2: Generate an API Key
Before deploying, you'll need an API key for programmatic access:
- Click Settings in the sidebar
- Navigate to API Keys
- Click Create New Key
- Give it a name like "my-first-deployment"
- Copy the key—you won't see it again!
Security tip: Store your API key in an environment variable, never in code. Treat it like a password.
Step 3: Create a Cluster
A cluster is a deployment of your containerized model across one or more GPU nodes. Let's create one:
- Click Clusters in the sidebar
- Click New Cluster
- Enter a name (e.g., "llama-inference")
- Choose your GPU type (RTX 4090 recommended for most models)
- Set replica count (start with 1)
- Enter your Docker image URL
Using Our Example Image
Don't have a Docker image ready? Use our example that runs a simple HTTP server responding to inference requests:
ghcr.io/vectorlay/examples/echo-server:latestConfiguration Options
You can also configure:
- Environment variables: Pass secrets and config to your container
- Port: Which port your container listens on (default: 8080)
- Health check path: Endpoint for liveness probes (default: /health)
- Startup timeout: How long to wait for your container to be ready
Step 4: Wait for Deployment
Click Deploy and watch the magic happen. You'll see your cluster go through these states:
- Pending: Finding available GPU nodes
- Deploying: Pulling your image and starting containers
- Running: Your model is live!
Deployment typically takes 1-3 minutes depending on image size and GPU availability.
Step 5: Make Your First Request
Once your cluster is running, you'll see an endpoint URL on the cluster details page. Let's test it with curl:
curl -X POST https://your-cluster.vectorlay.dev/inference \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"prompt": "Hello, world!"}'If you're using the echo server example, you'll get back:
{
"received": {"prompt": "Hello, world!"},
"gpu": "NVIDIA RTX 4090",
"node": "us-west-1-abc123"
}Step 6: Scale Up
Getting more traffic? Scale your cluster with a single click:
- Go to your cluster details page
- Click Scale
- Increase the replica count
- New replicas spin up on additional GPU nodes
VectorLay automatically load balances across all healthy replicas. If a node goes down, traffic is rerouted to remaining replicas and a replacement is scheduled.
Using the SDK
Prefer code over the dashboard? Use our Python SDK:
pip install vectorlay
import vectorlay
client = vectorlay.Client(api_key="YOUR_API_KEY")
# Create a cluster
cluster = client.clusters.create(
name="my-model",
image="your-registry/your-model:latest",
gpu_type="rtx-4090",
replicas=2
)
# Wait for deployment
cluster.wait_until_ready()
# Make inference requests
response = cluster.infer({"prompt": "Hello!"})
print(response)Monitoring & Logs
Once your cluster is running, you can monitor it from the dashboard:
- Metrics: Request count, latency percentiles, GPU utilization
- Logs: Container stdout/stderr streamed in real-time
- Events: Deployment events, scaling actions, health check results
Next Steps
You've successfully deployed your first model on VectorLay! Here's what to explore next: