Model API
- Welcome
- Get started
- Understand the API
- Tasks
Clusters
When you run open source models, we create serverless clusters for you
Open models are great, and they’re not easy to use as closed models. They require you, the developer, to think about extra steps, like which hardware to run on, if you should quantize, if you should keep your instances always on, off, or try to make them serverless. Running open models requires DevOps; this especially true when we’re talking about run large models affordably, at scale.
To make open models easy to use, our approach begins with benchmarking and ends with auto-scaling groups that act like serverless instances.
To begin, we benchmark models to understand how much GPU RAM is required for inference.
We use this information to place models on the right-size GPU.
Below is a simple breakdown of our process.
Our process “behind the scenes”
Add a new model to our sdk...
We debug the model and ensure it runs in a secure container
Model benchmarking
We determine how much RAM is required for inference
Instance selection
We select the cheapest GPU instances with enough RAM for the model
Model becomes available
We add the model to our sdk, making it available to you
Later, when you want to run the model
Cluster creation
To run an open model, you’ll need to a cluster. A cluster is an auto-scaling group, configured to have the right GPU-backed instances that support running your open model. This auto-scaling group auto-scales up and down for you, allowing open models to infinitely scale to your traffic demands. Like serverless setups, if the instances in your cluster don’t receive traffic after a certain period of time, they shut down, saving you money.
To create a cluster, you’ll either use model.create()
or model.run()
.
If you run model.create()
you’ll create a cluster for yourself, giving you full control.
If you choose to skip running model.create()
and instead, you execute model.run()
first, we’ll automatically run model.create()
for you and we’ll use the default params. After the create
operation succeeds, we then pass your run
request to the cluster.
import Bytez from "bytez.js";
const client = new Bytez("BYTEZ_KEY");
const model = client.model("openai-community/gpt-2")
const { error, output } = await model.create({
timeout: 10, // if reqs not received for `10` mins, then shut down
capacity: { // capacity can be thought of as `concurrency`
min: 0, // scale down to this number
desired: 1, // try to maintain this number of instances
max: 1, // scale up to this number
},
})
console.log({ error, output });
A cluster spins up instances. Instances cold boot.
Once a Cluster is created, it begins to spin up instances based on the cluster capacity, which configurable by you, the developer.
Each instance that spins up has a delay before its ready to serve traffic.
Let’s define this delay as cold boot
.
Under the hood, the cold boot
timeline includes provisioning the instance, which includes low-level operations, like the time it takes to virtually attach file systems and network cards.
Once the OS is booted, we have control, and we race to download weights, load them onto a GPU, and make the instance ready for traffic. We’ve optimized many of these steps, like skipping downloading weights, where possible, to reduce cold boot time.
Cluster updates
You can update a cluster at anytime by running model.update()
Let’s say you want to update your cluster timeout
to 2
minutes:
import Bytez from "bytez.js";
const client = new Bytez("BYTEZ_KEY");
const model = client.model("openai-community/gpt-2")
const { error, output } = await model.update({ timeout: 2 })
console.log({ error, output });
Auto-scaling
Clusters are configured with an auto-scaling policy that tries to maintain 1 concurrent request per an instance. This is because inference typically needs the entire GPU.
Scaling policy to maintain=instance countconcurrent requests=1
For example, if the load balancer detects that your traffic over time is:
1 instance2.1 concurrent requests=ratio over 1=Needs to scale up 5 instances3.5 concurrent requests=ratio below 1=Needs to scale down
Cluster deletion
There are two ways a cluster can be deleted:
- You can run
model.delete()
- Or, if the cluster is
idle
forN
number of minutes, it auto-terminates
Idle clusters
A cluster is idle
when it hasn’t receive requests for a certain N
number of minutes.
N
here is the timeout
parameter that’s part of your cluster config. Timeout
can be set during cluster creation or updated while your cluster is alive.
For example, lets say your cluster has a timeout
= 2
. If your cluster doesn’t receive requests for 2
minutes it auto-deletes. You can think of this as a safety measure, in case a developer forgets to run model.delete()
Cluster billing
Clusters launch instances. Instances cost money per second. Your cluster will incur a bill for the total number of instance seconds it accumulated over its life.
Let’s say a cluster had 2 instances that were active for t1=300 seconds and t2=120 seconds, respectively.
-
The total instance seconds is calculated as: ttotal=t1+t2=300s+120s=420s
-
Assume the price per second for the instance type is $0.00001.
The total bill is then calculated using the formula:
Bill=(Total Instance Seconds)×(Price Per Second)Plugging in the values:
Bill=420 seconds×second$0.00001=$0.0042Let’s say you want to create an “openai-community/gpt-2” cluster, that times out after 2 minutes.
You want a concurrency of 2 requests at any given moment, so you set capacity to 2.
import Bytez from "bytez.js";
const client = new Bytez("BYTEZ_KEY");
const model = client.model("openai-community/gpt-2")
const { error, output } = await model.create({
timeout: 2,
capacity: { min: 2, max: 2 }
})
console.log({ error, output });
const { error, output } = await model.read()
console.log({ error, output });
const { error, output } = await model.update({
capacity: { min: 5, max: 5 }
})
console.log({ error, output });
const { error, output } = await model.delete()
console.log({ error, output });
const input = "your model input"
const { error, output } = await model.run(input)
console.log({ error, output });