We handle open & closed models differently
Closed-Source Models (e.g., OpenAI, Anthropic, Gemini)
You Send Request
We Translate Input
Forward Request
Provider Computes
We Translate Output
You Receive Response
Open‑Source Models – Serverless GPU Inference
Create cluster (optional, recommended)
Model Inference
Automatic scaling
concurrent requests
received.With deep learning models, typically the entire GPU is used for inference. This means that if you need 2 concurrent requests at any given moment, we need to spin up 2 instances of the model.The cluster tries to maintain the number of instances equal to the number of concurrent requests
.• Scaling rule: instances = concurrent requests
Auto shutdown
concurrent requests = 0
for 10
minutes, then the cluster is considered idle
and shuts down.By default, timeout
is 10
mins (see above). You can configure this when you create a cluster or after a cluster is created.