Clusters - Bytez

What are clusters?

Open models are great, and they’re not easy to use as closed models. They require you, the developer, to think about extra steps, like which hardware to run on, if you should quantize, if you should keep your instances always on, off, or try to make them serverless. Running open models requires DevOps; this especially true when we’re talking about run large models affordably, at scale.

To make open models easy to use, our approach begins with benchmarking and ends with auto-scaling groups that act like serverless instances.

To begin, we benchmark models to understand how much GPU RAM is required for inference.

We use this information to place models on the right-size GPU.

Below is a simple breakdown of our process.

Our process “behind the scenes”

1

Add a new model to our sdk...

We debug the model and ensure it runs in a secure container

2

Model benchmarking

We determine how much RAM is required for inference

3

Instance selection

We select the cheapest GPU instances with enough RAM for the model

4

Model becomes available

We add the model to our sdk, making it available to you

Later, when you want to run the model

1

Cluster creation

To run an open model, you’ll need to a cluster. A cluster is an auto-scaling group, configured to have the right GPU-backed instances that support running your open model. This auto-scaling group auto-scales up and down for you, allowing open models to infinitely scale to your traffic demands. Like serverless setups, if the instances in your cluster don’t receive traffic after a certain period of time, they shut down, saving you money.

To create a cluster, you’ll either use model.create() or model.run().

If you run model.create() you’ll create a cluster for yourself, giving you full control.

If you choose to skip running model.create() and instead, you execute model.run() first, we’ll automatically run model.create() for you and we’ll use the default params. After the create operation succeeds, we then pass your run request to the cluster.

import Bytez from "bytez.js";

const client = new Bytez("BYTEZ_KEY");
const model = client.model("openai-community/gpt-2")

const { error, output } = await model.create({
  timeout: 10,   // if reqs not received for `10` mins, then shut down
  capacity: {   // capacity can be thought of as `concurrency`
    min: 0,     // scale down to this number
    desired: 1, // try to maintain this number of instances
    max: 1,     // scale up to this number
  },
})

console.log({ error, output });

Creating a cluster can take 5-10 seconds before the operation completes. Under the hood, we create an auto-scaling group with a load balancer, which uses the right-sized GPU-backed instances to run your selected model.

Cluster creation completes when the first instance is provisioned. Note, even though a instance is provisioned, it doesn’t mean it has completed its cold boot

2

A cluster spins up instances. Instances cold boot.

Once a Cluster is created, it begins to spin up instances based on the cluster capacity, which configurable by you, the developer.

Each instance that spins up has a delay before its ready to serve traffic.

Let’s define this delay as cold boot.

Under the hood, the cold boot timeline includes provisioning the instance, which includes low-level operations, like the time it takes to virtually attach file systems and network cards.

Once the OS is booted, we have control, and we race to download weights, load them onto a GPU, and make the instance ready for traffic. We’ve optimized many of these steps, like skipping downloading weights, where possible, to reduce cold boot time.

Cold boot can take 20-60 seconds depending how many GBs of params are being downloaded/loaded onto the GPU

3

Cluster updates

You can update a cluster at anytime by running model.update()

Let’s say you want to update your cluster timeout to 2 minutes:

Updating a cluster

import Bytez from "bytez.js";

const client = new Bytez("BYTEZ_KEY");
const model = client.model("openai-community/gpt-2")

const { error, output } = await model.update({ timeout: 2 })

console.log({ error, output });

4

Auto-scaling

Clusters are configured with an auto-scaling policy that tries to maintain 1 concurrent request per an instance. This is because inference typically needs the entire GPU.

\text{Scaling policy to maintain} = \frac{\text{concurrent requests}}{\text{instance count}} = 1

For example, if the load balancer detects that your traffic over time is:

\frac{\text{2.1 concurrent requests}}{\text{1 instance}} = \text{ratio over 1} = \text{Needs to scale up }

\frac{\text{3.5 concurrent requests}}{\text{5 instances}} = \text{ratio below 1} = \text{Needs to scale down}

5

Cluster deletion

There are two ways a cluster can be deleted:

You can run model.delete()
Or, if the cluster is idle for N number of minutes, it auto-terminates

Idle clusters

A cluster is idle when it hasn’t receive requests for a certain N number of minutes.

N here is the timeout parameter that’s part of your cluster config. Timeout can be set during cluster creation or updated while your cluster is alive.

For example, lets say your cluster has a timeout = 2. If your cluster doesn’t receive requests for 2 minutes it auto-deletes. You can think of this as a safety measure, in case a developer forgets to run model.delete()

6

Cluster billing

Clusters launch instances. Instances cost money per second. Your cluster will incur a bill for the total number of instance seconds it accumulated over its life.

Let’s say a cluster had 2 instances that were active for $t_1 = 300$ seconds and $t_2 = 120$ seconds, respectively.

The total instance seconds is calculated as: $t_{total} = t_1 + t_2 = 300s + 120s = 420s$
Assume the price per second for the instance type is $0.00001.

The total bill is then calculated using the formula:

\text{Bill} = (\text{Total Instance Seconds}) \times (\text{Price Per Second})

Plugging in the values:

\text{Bill} = 420 \text{ seconds} \times \frac{\$0.00001}{\text{second}} = \$0.0042

CREATE a cluster

Let’s say you want to create an “openai-community/gpt-2” cluster, that times out after 2 minutes.

You want a concurrency of 2 requests at any given moment, so you set capacity to 2.

import Bytez from "bytez.js";

const client = new Bytez("BYTEZ_KEY");
const model = client.model("openai-community/gpt-2")

const { error, output } = await model.create({
  timeout: 2,
  capacity: { min: 2, max: 2 }
})

console.log({ error, output });

READ a cluster

UPDATE a cluster

DELETE a cluster

Run your model on the cluster

Model API