Deploy Your First Model on VectorLay

Step 1: Create Your Account

Head to app.vectorlay.com and sign up with your email or GitHub account. You'll get $10 in free credits to start—enough to run a small model for several hours.

Once you're in, you'll land on the dashboard. This is your command center for managing clusters, viewing usage, and generating API keys.

Step 2: Generate an API Key

Before deploying, you'll need an API key for programmatic access:

Click Settings in the sidebar
Navigate to API Keys
Click Create New Key
Give it a name like "my-first-deployment"
Copy the key—you won't see it again!

Security tip: Store your API key in an environment variable, never in code. Treat it like a password.

Step 3: Create a Cluster

A cluster is a deployment of your containerized model across one or more GPU nodes. Let's create one:

Click Clusters in the sidebar
Click New Cluster
Enter a name (e.g., "llama-inference")
Choose your GPU type (RTX 4090 recommended for most models)
Set replica count (start with 1)
Enter your Docker image URL

Using Our Example Image

Don't have a Docker image ready? Use our example that runs a simple HTTP server responding to inference requests:

ghcr.io/vectorlay/examples/echo-server:latest

Configuration Options

You can also configure:

Environment variables: Pass secrets and config to your container
Port: Which port your container listens on (default: 8080)
Health check path: Endpoint for liveness probes (default: /health)
Startup timeout: How long to wait for your container to be ready

Step 4: Wait for Deployment

Click Deploy and watch the magic happen. You'll see your cluster go through these states:

Pending: Finding available GPU nodes
Deploying: Pulling your image and starting containers
Running: Your model is live!

Deployment typically takes 1-3 minutes depending on image size and GPU availability.

Step 5: Make Your First Request

Once your cluster is running, you'll see an endpoint URL on the cluster details page. Let's test it with curl:

curl -X POST https://your-cluster.vectorlay.dev/inference \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Hello, world!"}'

If you're using the echo server example, you'll get back:

{
  "received": {"prompt": "Hello, world!"},
  "gpu": "NVIDIA RTX 4090",
  "node": "us-west-1-abc123"
}

Step 6: Scale Up

Getting more traffic? Scale your cluster with a single click:

Go to your cluster details page
Click Scale
Increase the replica count
New replicas spin up on additional GPU nodes

VectorLay automatically load balances across all healthy replicas. If a node goes down, traffic is rerouted to remaining replicas and a replacement is scheduled.

Using the SDK

Prefer code over the dashboard? Use our Python SDK:

pip install vectorlay

import vectorlay

client = vectorlay.Client(api_key="YOUR_API_KEY")

# Create a cluster
cluster = client.clusters.create(
    name="my-model",
    image="your-registry/your-model:latest",
    gpu_type="rtx-4090",
    replicas=2
)

# Wait for deployment
cluster.wait_until_ready()

# Make inference requests
response = cluster.infer({"prompt": "Hello!"})
print(response)

Monitoring & Logs

Once your cluster is running, you can monitor it from the dashboard:

Metrics: Request count, latency percentiles, GPU utilization
Logs: Container stdout/stderr streamed in real-time
Events: Deployment events, scaling actions, health check results

Next Steps

You've successfully deployed your first model on VectorLay! Here's what to explore next:

What You'll Need