Skip to content

Supported Models

The vLLM Spyre plugin relies on model code implemented by the Foundation Model Stack.

Configurations

The following models have been verified to run on vLLM Spyre with the listed configurations.

Decoder Models

Static Batching:

Model AIUs Prompt Length New Tokens Batch Size
Granite-3.3-8b 4 7168 1024 4

Continuous Batching:

Model AIUs Context Length Batch Size
Granite-3.3-8b 1 3072 16
Granite-3.3-8b 4 32768 32
Granite-3.3-8b (FP8) 1 3072 16
Granite-3.3-8b (FP8) 4 32768 32

Encoder Models

Model AIUs Context Length Batch Size
Granite-Embedding-125m (English) 1 512 1
Granite-Embedding-125m (English) 1 512 64
Granite-Embedding-278m (Multilingual) 1 512 1
Granite-Embedding-278m (Multilingual) 1 512 64
BAAI/BGE-Reranker (v2-m3) 1 8192 1
BAAI/BGE-Reranker (Large) 1 512 1
BAAI/BGE-Reranker (Large) 1 512 64

Runtime Validation

At runtime, the Spyre engine validates the requested model and configurations against the list of supported models and configurations based on the entries in the file vllm_spyre/config/supported_configurations.yaml. If a requested model or configuration is not found, a warning message will be logged.

# Parameters:
#  - cb: True, for continuous batching; False, for static batching mode
#  - tp_size: tensor parallel size
#  - max_model_len: context length (prompt_length + max_new_tokens)
#  - max_num_seqs: number of sequences in a batch (per instance)
#  - warmup_shapes: [(fixed_prompt_length, max_new_tokens, batch_size)]

- model: "ibm-granite/granite-3.3-8b-instruct"
  configs: [
    { cb: False, tp_size: 1, warmup_shapes: [[2048, 1024, 16]] },
    { cb: False, tp_size: 4, warmup_shapes: [[6144, 2048,  1]] },
    { cb: False, tp_size: 4, warmup_shapes: [[7168, 1024,  4]] },
    { cb: True,  tp_size: 1, max_model_len: 3072,  max_num_seqs: 16 },
    { cb: True,  tp_size: 1, max_model_len: 8192,  max_num_seqs: 4 },
    { cb: True,  tp_size: 2, max_model_len: 8192,  max_num_seqs: 4 },
    { cb: True,  tp_size: 4, max_model_len: 32768, max_num_seqs: 32 },
  ]
- model: "ibm-granite/granite-3.3-8b-instruct-FP8"
  configs: [
    { cb: True, tp_size: 1, max_model_len: 3072,  max_num_seqs: 16 },
    { cb: True, tp_size: 4, max_model_len: 16384, max_num_seqs: 4 },
    { cb: True, tp_size: 4, max_model_len: 32768, max_num_seqs: 32 },
  ]
- model: "ibm-granite/granite-embedding-125m-english"
  configs: [
    { cb: False, tp_size: 1, warmup_shapes: [[512, 0, 64]] },
  ]
- model: "ibm-granite/granite-embedding-278m-multilingual"
  configs: [
    { cb: False, tp_size: 1, warmup_shapes: [[512, 0, 64]] },
  ]
- model: "BAAI/bge-reranker-v2-m3"
  configs: [
    { cb: False, tp_size: 1, warmup_shapes: [[8192, 0, 1]] },
  ]
- model: "BAAI/bge-reranker-large"
  configs: [
    { cb: False, tp_size: 1, warmup_shapes: [[512, 0, 64]] },
  ]
- model: "sentence-transformers/all-roberta-large-v1"
  configs: [
    { cb: False, tp_size: 1, warmup_shapes: [[128, 0, 8]] },
  ]