Supported Models¶
The vLLM Spyre plugin relies on model code implemented by the Foundation Model Stack.
Configurations¶
The following models have been verified to run on vLLM Spyre with the listed configurations.
Decoder Models¶
Static Batching:
Model | AIUs | Prompt Length | New Tokens | Batch Size |
---|---|---|---|---|
Granite-3.3-8b | 4 | 7168 | 1024 | 4 |
Continuous Batching:
Model | AIUs | Context Length | Batch Size |
---|---|---|---|
Granite-3.3-8b | 1 | 3072 | 16 |
Granite-3.3-8b | 4 | 32768 | 32 |
Granite-3.3-8b (FP8) | 1 | 3072 | 16 |
Granite-3.3-8b (FP8) | 4 | 32768 | 32 |
Encoder Models¶
Model | AIUs | Context Length | Batch Size |
---|---|---|---|
Granite-Embedding-125m (English) | 1 | 512 | 1 |
Granite-Embedding-125m (English) | 1 | 512 | 64 |
Granite-Embedding-278m (Multilingual) | 1 | 512 | 1 |
Granite-Embedding-278m (Multilingual) | 1 | 512 | 64 |
BAAI/BGE-Reranker (v2-m3) | 1 | 8192 | 1 |
BAAI/BGE-Reranker (Large) | 1 | 512 | 1 |
BAAI/BGE-Reranker (Large) | 1 | 512 | 64 |
Runtime Validation¶
At runtime, the Spyre engine validates the requested model and configurations against the list of supported models and configurations based on the entries in the file vllm_spyre/config/supported_configurations.yaml. If a requested model or configuration is not found, a warning message will be logged.
# Parameters:
# - cb: True, for continuous batching; False, for static batching mode
# - tp_size: tensor parallel size
# - max_model_len: context length (prompt_length + max_new_tokens)
# - max_num_seqs: number of sequences in a batch (per instance)
# - warmup_shapes: [(fixed_prompt_length, max_new_tokens, batch_size)]
- model: "ibm-granite/granite-3.3-8b-instruct"
configs: [
{ cb: False, tp_size: 1, warmup_shapes: [[2048, 1024, 16]] },
{ cb: False, tp_size: 4, warmup_shapes: [[6144, 2048, 1]] },
{ cb: False, tp_size: 4, warmup_shapes: [[7168, 1024, 4]] },
{ cb: True, tp_size: 1, max_model_len: 3072, max_num_seqs: 16 },
{ cb: True, tp_size: 1, max_model_len: 8192, max_num_seqs: 4 },
{ cb: True, tp_size: 2, max_model_len: 8192, max_num_seqs: 4 },
{ cb: True, tp_size: 4, max_model_len: 32768, max_num_seqs: 32 },
]
- model: "ibm-granite/granite-3.3-8b-instruct-FP8"
configs: [
{ cb: True, tp_size: 1, max_model_len: 3072, max_num_seqs: 16 },
{ cb: True, tp_size: 4, max_model_len: 16384, max_num_seqs: 4 },
{ cb: True, tp_size: 4, max_model_len: 32768, max_num_seqs: 32 },
]
- model: "ibm-granite/granite-embedding-125m-english"
configs: [
{ cb: False, tp_size: 1, warmup_shapes: [[512, 0, 64]] },
]
- model: "ibm-granite/granite-embedding-278m-multilingual"
configs: [
{ cb: False, tp_size: 1, warmup_shapes: [[512, 0, 64]] },
]
- model: "BAAI/bge-reranker-v2-m3"
configs: [
{ cb: False, tp_size: 1, warmup_shapes: [[8192, 0, 1]] },
]
- model: "BAAI/bge-reranker-large"
configs: [
{ cb: False, tp_size: 1, warmup_shapes: [[512, 0, 64]] },
]
- model: "sentence-transformers/all-roberta-large-v1"
configs: [
{ cb: False, tp_size: 1, warmup_shapes: [[128, 0, 8]] },
]