Closed-Source Models (e.g., OpenAI, Anthropic, Gemini)
Closed-Source Models (e.g., OpenAI, Anthropic, Gemini)
Think of us as a smart, multi-lingual translator and secure messenger when you use closed-source models. Our Unified Model Protocol means you use one consistent format for your requests and receive responses in one consistent format, regardless of the underlying provider.The Process:Key Takeaway: For closed-source models, we act as a router and standardization layer. You interact with a single, unified protocol, making it easy to switch between models providers or use multiple providers without changing your code structure. The inference itself happens on the provider’s infrastructure.Billing: We don’t charge anything for closed source models. Billing for closed-source models is based on the provider’s pricing. They’ll bill you based on the API key you provide.
1
You Send Request
Your app sends an API request using our standardized input format
2
We Translate Input
We automatically translate your request into the specific format required by
the chosen model provider (e.g., OpenAI, Google Gemini)
3
Forward Request
We securely pass your request to the model provider’s API, using your API
key, so the provider knows it’s from you
4
Provider Computes
The provider runs inference on their servers
5
We Translate Output
We receive the provider’s raw response and translate to standardized JSON
6
You Receive Response
Your app gets inference results back in standardized JSON
Open‑Source Models – Serverless GPU Inference
Open‑Source Models – Serverless GPU Inference
When you run an open‑source model, Bytez spins up a GPU cluster, runs inference on it, and then tears it down when your traffic stops.
You get zero‑to‑scale performance without managing any infrastructure.How the lifecycle worksKey Takeaway: We try to make open source models NoOps for you. We manage hardware, auto-scaling, and clean up. You can configure your clusters to make them work for you, and at any moment, you can use our Cluster CRUD to have full control.Billing: Open source models follow an instance-based billing model. You’re charged for the total instance seconds across all active instances in your cluster.Let’s say a cluster had 2 instances that were active for seconds and seconds, respectively.
0
Create cluster (optional, recommended)
You can create a cluster in advance to reduce cold‑start latency. This is optional, but recommended for production workloads, as it gives you full control.Notes:
- Cold start latency ≈ 15‑60s, depending on model size. We try to minimize this.
- Cluster instances are single tenant, meaning you get exclusive access to GPUs.
1
Model Inference
When you run a model, if you don’t have a cluster created, we create one for you using the defaults above.We route your request to your model cluster, which load balances the request across instances.
2
Automatic scaling
The cluster auto-scales based on
• More traffic? Cluster scales up
• Less traffic? Cluster scales downYour open source model cluster scales to your needs.
concurrent requests
received.With deep learning models, typically the entire GPU is used for inference. This means that if you need 2 concurrent requests at any given moment, we need to spin up 2 instances of the model.The cluster tries to maintain the number of instances equal to the number of concurrent requests
.• Scaling rule: instances = concurrent requests
• More traffic? Cluster scales up
• Less traffic? Cluster scales downYour open source model cluster scales to your needs.
3
Auto shutdown
If
concurrent requests = 0
for 10
minutes, then the cluster is considered idle
and shuts down.By default, timeout
is 10
mins (see above). You can configure this when you create a cluster or after a cluster is created.- The total instance seconds is calculated as:
- Assume the price per second for the instance type is $0.00001.