Open vs Closed models

Our API simplifies working with a wide variety of AI models, including both popular closed-source options and flexible open-source alternatives. Although how we handle requests differs behind the scenes depending on the model type, you benefit significantly from our Unified Model Protocol.

This protocol means you can use the same input format to interact with any model on our platform—whether it’s open or closed-source. It makes experimenting and switching between models much easier, almost like swapping Lego bricks in your project. This consistency frees you up to focus purely on building your application logic, rather than managing different provider interfaces.

Now, let’s dive into the specific ways we handle requests under the hood for open vs. closed models to provide this seamless experience.

Closed-Source Models (e.g., OpenAI, Anthropic, Gemini)

Open‑Source Models –  Serverless GPU Inference

When you run an open‑source model, Bytez spins up a GPU cluster, runs inference on it, and then tears it down when your traffic stops. You get zero‑to‑scale performance without managing any infrastructure.

How the lifecycle works

Create cluster (optional, recommended)

You can create a cluster in advance to reduce cold‑start latency. This is optional, but recommended for production workloads, as it gives you full control.

from bytez import Bytez

client = Bytez("BYTEZ_KEY")
model = client.model("openai-community/gpt-2")

output, error = model.create({
  "timeout": 10,   # if reqs not received for `10` mins, then shut down
  "capacity": {   # can be thought of as concurrency
    "min": 0,     # scale down to this number
    "desired": 1, # try to maintain this number of instances
    "max": 1,     # scale up to this number
  },
})

print({ "error": error, "output": output })

Notes:

Cold start latency ≈ 15‑60s, depending on model size. We try to minimize this.
Cluster instances are single tenant, meaning you get exclusive access to GPUs.

Model Inference

When you run a model, if you don’t have a cluster created, we create one for you using the defaults above.

We route your request to your model cluster, which load balances the request across instances.

Automatic scaling

The cluster auto-scales based on concurrent requests received.

With deep learning models, typically the entire GPU is used for inference. This means that if you need 2 concurrent requests at any given moment, we need to spin up 2 instances of the model.

The cluster tries to maintain the number of instances equal to the number of concurrent requests.

• Scaling rule: instances = concurrent requests
• More traffic? Cluster scales up
• Less traffic? Cluster scales down

Your open source model cluster scales to your needs.

Auto shutdown

If concurrent requests = 0 for 10 minutes, then the cluster is considered idle and shuts down.

By default, timeout is 10 mins (see above). You can configure this when you create a cluster or after a cluster is created.

Key Takeaway: We try to make open source models NoOps for you. We manage hardware, auto-scaling, and clean up. You can configure your clusters to make them work for you, and at any moment, you can use our Cluster CRUD to have full control.

Billing: Open source models follow an instance-based billing model. You’re charged for the total instance seconds across all active instances in your cluster.

Let’s say a cluster had 2 instances that were active for $t_1 = 300$ seconds and $t_2 = 120$ seconds, respectively.

The total instance seconds is calculated as: $t_{total} = t_1 + t_2 = 300s + 120s = 420s$
Assume the price per second for the instance type is $0.00001.

The total bill is then calculated using the formula:

\text{Bill} = (\text{Total Instance Seconds}) \times (\text{Price Per Second})

Plugging in the values:

\text{Bill} = 420 \text{ seconds} \times \frac{\$0.00001}{\text{second}} = \$0.0042

Our goal with open source models is to make them as easy and affordable to use closed source models.

Model API