Supported Models¶

The vLLM Spyre plugin relies on model code implemented by the Foundation Model Stack.

Configurations¶

The following models have been verified to run on vLLM Spyre with the listed configurations.

Decoder Models¶

Static Batching:

Model	AIUs	Prompt Length	New Tokens	Batch Size
Granite-3.3-8b	4	7168	1024	4

Continuous Batching:

Model	AIUs	Context Length	Batch Size
Granite-3.3-8b	1	3072	16
Granite-3.3-8b	4	32768	32
Granite-3.3-8b (FP8)	1	3072	16
Granite-3.3-8b (FP8)	4	32768	32

Encoder Models¶

Model	AIUs	Context Length	Batch Size
Granite-Embedding-125m (English)	1	512	1
Granite-Embedding-125m (English)	1	512	64
Granite-Embedding-278m (Multilingual)	1	512	1
Granite-Embedding-278m (Multilingual)	1	512	64
BAAI/BGE-Reranker (v2-m3)	1	8192	1
BAAI/BGE-Reranker (Large)	1	512	1
BAAI/BGE-Reranker (Large)	1	512	64

Runtime Validation¶

At runtime, the Spyre engine validates the requested model and configurations against the list of supported models and configurations based on the entries in the file vllm_spyre/config/supported_configurations.yaml. If a requested model or configuration is not found, a warning message will be logged.

# Parameters:
#  - cb: True, for continuous batching; False, for static batching mode
#  - tp_size: tensor parallel size
#  - max_model_len: context length (prompt_length + max_new_tokens)
#  - max_num_seqs: number of sequences in a batch (per instance)
#  - warmup_shapes: [(fixed_prompt_length, max_new_tokens, batch_size)]

- model: "ibm-granite/granite-3.3-8b-instruct"
  configs: [
    { cb: False, tp_size: 1, warmup_shapes: [[2048, 1024, 16]] },
    { cb: False, tp_size: 4, warmup_shapes: [[6144, 2048,  1]] },
    { cb: False, tp_size: 4, warmup_shapes: [[7168, 1024,  4]] },
    { cb: True,  tp_size: 1, max_model_len: 3072,  max_num_seqs: 16 },
    { cb: True,  tp_size: 1, max_model_len: 8192,  max_num_seqs: 4 },
    { cb: True,  tp_size: 2, max_model_len: 8192,  max_num_seqs: 4 },
    { cb: True,  tp_size: 4, max_model_len: 32768, max_num_seqs: 32 },
  ]
- model: "ibm-granite/granite-3.3-8b-instruct-FP8"
  configs: [
    { cb: True, tp_size: 1, max_model_len: 3072,  max_num_seqs: 16 },
    { cb: True, tp_size: 4, max_model_len: 16384, max_num_seqs: 4 },
    { cb: True, tp_size: 4, max_model_len: 32768, max_num_seqs: 32 },
  ]
- model: "ibm-granite/granite-embedding-125m-english"
  configs: [
    { cb: False, tp_size: 1, warmup_shapes: [[512, 0, 64]] },
  ]
- model: "ibm-granite/granite-embedding-278m-multilingual"
  configs: [
    { cb: False, tp_size: 1, warmup_shapes: [[512, 0, 64]] },
  ]
- model: "BAAI/bge-reranker-v2-m3"
  configs: [
    { cb: False, tp_size: 1, warmup_shapes: [[8192, 0, 1]] },
  ]
- model: "BAAI/bge-reranker-large"
  configs: [
    { cb: False, tp_size: 1, warmup_shapes: [[512, 0, 64]] },
  ]
- model: "sentence-transformers/all-roberta-large-v1"
  configs: [
    { cb: False, tp_size: 1, warmup_shapes: [[128, 0, 8]] },
  ]