Open vs Closed models
We handle open & closed models differently
Our API simplifies working with a wide variety of AI models, including both popular closed-source options and flexible open-source alternatives. Although how we handle requests differs behind the scenes depending on the model type, you benefit significantly from our Unified Model Protocol.
This protocol means you can use the same input format to interact with any model on our platform—whether it’s open or closed-source. It makes experimenting and switching between models much easier, almost like swapping Lego bricks in your project. This consistency frees you up to focus purely on building your application logic, rather than managing different provider interfaces.
Now, let’s dive into the specific ways we handle requests under the hood for open vs. closed models to provide this seamless experience.
Closed-Source Models (e.g., OpenAI, Anthropic, Gemini)
Closed-Source Models (e.g., OpenAI, Anthropic, Gemini)
Think of us as a smart, multi-lingual translator and secure messenger when you use closed-source models. Our Unified Model Protocol means you use one consistent format for your requests and receive responses in one consistent format, regardless of the underlying provider.
The Process:
You Send Request
Your app sends an API request using our standardized input format
We Translate Input
We automatically translate your request into the specific format required by the chosen model provider (e.g., OpenAI, Google Gemini)
Forward Request
We securely pass your request to the model provider’s API, using your API key, so the provider knows it’s from you
Provider Computes
The provider runs inference on their servers
We Translate Output
We receive the provider’s raw response and translate to standardized JSON
You Receive Response
Your app gets inference results back in standardized JSON
Key Takeaway: For closed-source models, we act as a router and standardization layer. You interact with a single, unified protocol, making it easy to switch between models providers or use multiple providers without changing your code structure. The inference itself happens on the provider’s infrastructure.
Billing: We don’t charge anything for closed source models. Billing for closed-source models is based on the provider’s pricing. They’ll bill you based on the API key you provide.
Open‑Source Models – Serverless GPU Inference
Open‑Source Models – Serverless GPU Inference
When you run an open‑source model, Bytez spins up a GPU cluster, runs inference on it, and then tears it down when your traffic stops. You get zero‑to‑scale performance without managing any infrastructure.
How the lifecycle works
Create cluster (optional, recommended)
You can create a cluster in advance to reduce cold‑start latency. This is optional, but recommended for production workloads, as it gives you full control.
Notes:
- Cold start latency ≈ 15‑60s, depending on model size. We try to minimize this.
- Cluster instances are single tenant, meaning you get exclusive access to GPUs.
Model Inference
When you run a model, if you don’t have a cluster created, we create one for you using the defaults above.
We route your request to your model cluster, which load balances the request across instances.
Automatic scaling
The cluster auto-scales based on concurrent requests
received.
With deep learning models, typically the entire GPU is used for inference. This means that if you need 2 concurrent requests at any given moment, we need to spin up 2 instances of the model.
The cluster tries to maintain the number of instances equal to the number of concurrent requests
.
• Scaling rule: instances = concurrent requests
• More traffic? Cluster scales up
• Less traffic? Cluster scales down
Your open source model cluster scales to your needs.
Auto shutdown
If concurrent requests = 0
for 10
minutes, then the cluster is considered idle
and shuts down.
By default, timeout
is 10
mins (see above). You can configure this when you create a cluster or after a cluster is created.
Key Takeaway: We try to make open source models NoOps for you. We manage hardware, auto-scaling, and clean up. You can configure your clusters to make them work for you, and at any moment, you can use our Cluster CRUD to have full control.
Billing: Open source models follow an instance-based billing model. You’re charged for the total instance seconds across all active instances in your cluster.
Let’s say a cluster had 2 instances that were active for seconds and seconds, respectively.
-
The total instance seconds is calculated as:
-
Assume the price per second for the instance type is $0.00001.
The total bill is then calculated using the formula:
Plugging in the values:
Our goal with open source models is to make them as easy and affordable to use closed source models.