Skip to main content
Every open-source model on Bytez is available as a Docker image. Pull it, run it, and make requests to localhost.
Images are hosted on Docker Hub under the bytez namespace. The image name matches the model ID with / replaced by _.
# Pattern: bytez/{org}_{model-name}
docker pull bytez/qwen_qwen3-4b
Find model IDs at bytez.com/models or via the List Models API.
docker run -d \
  -e KEY=YOUR_BYTEZ_KEY \
  -e PORT=8000 \
  -p 8000:8000 \
  bytez/qwen_qwen3-4b
This runs the container in the background. Get your API key at bytez.com/api/key.View logs
# Follow logs (live)
docker logs -f <container_id>

# View recent logs
docker logs <container_id>
To run attached and watch logs directly, replace -d with -it. Press Ctrl+C to stop.
Environment Variables
VariableRequiredDefaultDescription
KEYYes-Your Bytez API key (for analytics and update notifications)
PORTNo80Port the server listens on inside the container
DEVICENoautoWhere to load weights: auto, cuda, or cpu
Docker Options
OptionDescription
--gpus allEnable GPU acceleration (requires NVIDIA drivers + CUDA)
-v /local/path:/server/modelMount a local directory for weight caching
-p HOST:CONTAINERMap container port to host port
Run on GPU
docker run -d \
  --gpus all \
  -e KEY=YOUR_BYTEZ_KEY \
  -e PORT=8000 \
  -p 8000:8000 \
  bytez/qwen_qwen3-4b
Run on CPU
docker run -d \
  -e DEVICE=cpu \
  -e KEY=YOUR_BYTEZ_KEY \
  -e PORT=8000 \
  -p 8000:8000 \
  bytez/qwen_qwen3-4b
Cache Weights LocallyAvoid re-downloading weights on every run by mounting a local directory:
docker run -d \
  --gpus all \
  -v /path/to/cache:/server/model \
  -e HF_HOME=/server/model \
  -e KEY=YOUR_BYTEZ_KEY \
  -e PORT=8000 \
  -p 8000:8000 \
  bytez/qwen_qwen3-4b
If you’re going to create the same model container multiple times, then for large models (70B+), caching is highly recommended. Downloads can take hours otherwise.
Once the container is running, send POST requests to /run.Chat Models
curl -X POST http://localhost:8000/run \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "system", "content": "You are a helpful assistant" },
      { "role": "user", "content": "What is the capital of France?" }
    ],
    "stream": false,
    "params": {
      "max_new_tokens": 100,
      "temperature": 0.7
    }
  }'
StreamingSet "stream": true to receive tokens as they’re generated:
curl -X POST http://localhost:8000/run \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      { "role": "user", "content": "Write a haiku about coding" }
    ],
    "stream": true
  }'
Different model tasks require different inputs. Here’s a quick reference:
TaskRequired FieldsExample
chatmessages{"messages": [{"role": "user", "content": "Hi"}]}
text-generationtext{"text": "Once upon a time"}
image-text-to-textmessages with image{"messages": [{"role": "user", "content": [{"type": "text", "text": "Describe this"}, {"type": "image", "url": "..."}]}]}
text-to-imagetext{"text": "A cat in space"}
automatic-speech-recognitionurl or base64{"url": "https://example.com/audio.wav"}
feature-extractiontext{"text": "Embed this sentence"}

Full HTTP Reference

See complete request/response params and examples for all 30+ task types.
Create a self-contained image with weights baked in - no internet required at runtime.Step 1: Run the container once to download weights
docker run -d \
  -e KEY=YOUR_BYTEZ_KEY \
  -e PORT=8000 \
  -p 8000:8000 \
  --name my-model \
  bytez/qwen_qwen3-4b
Wait for the model to fully load (check with docker logs -f my-model). Once ready, stop it:
docker stop my-model
Step 2: Save as a new image
docker commit my-model my-model-offline
Step 3: Run offline (no internet needed)
docker run -d \
  -e KEY=YOUR_BYTEZ_KEY \
  -e PORT=8000 \
  -p 8000:8000 \
  my-model-offline
To verify it’s truly offline, add --network none to the run command.
Optional: Export for another machine
# Save to a file
docker save my-model-offline -o my-model-offline.tar

# Load on another machine
docker load -i my-model-offline.tar
Container won’t startCheck that Docker is installed and running. For GPU support, ensure you have NVIDIA drivers and the NVIDIA Container Toolkit installed.Out of memoryTry DEVICE=auto to split the model across GPU and CPU memory. For large models, you may need more VRAM or system RAM.Slow first requestThe first request loads model weights into memory. Subsequent requests are fast. Use weight caching (-v mount) to speed up container restarts.Model only works with specific DEVICE settingSome models only support auto, cuda, or cpu. If one doesn’t work, try another.

Need Help?