Output Tests¶
Note
Unless otherwise specified, all the continuous batching tests are running with max_model_len=256
Verification of vLLM output by comparing with HF
Run python -m pytest tests/e2e/test_spyre_basic.py
.
test_output
¶
test_output(model: ModelInfo, tp_size: int, backend: str, cb: int, max_num_seqs: int, max_model_len: int, warmup_shapes: DecodeWarmupShapes, monkeypatch: MonkeyPatch, use_llm_cache) -> None
The warmup is based on a single shape. After the warmup, one request with the provided prompts is input to vLLM. The same prompts are also input to HF. The generated output including text, token ids, and logprobs, is verified to be identical for vLLM and HF.
Configuration for CB - parameters are combinatorial
- max_num_seqs: 4
- tensor parallelism: 1, 2, 4, 8
- number of prompts: 4 (Chicken soup prompts)
- max tokens: 20 (same for all the prompts)
Source code in tests/e2e/test_spyre_basic.py
test_batch_handling
¶
test_batch_handling(model: ModelInfo, backend: str, cb: int, warmup_shapes, max_num_seqs: int, max_model_len: int, monkeypatch: MonkeyPatch, use_llm_cache)
Test that the spyre worker correctly handles continuous batches of requests that finish after different numbers of forward passes
Configuration for CB - parameters are combinatorial
- max_num_seqs: 2
- number of prompts: 4 (Chicken soup prompts)
- max tokens: [5, 20, 10, 5]