Output Tests¶

Note

Unless otherwise specified, all the continuous batching tests are running with max_model_len=256

Verification of vLLM output by comparing with HF

Run python -m pytest tests/e2e/test_spyre_basic.py.

test_output ¶

test_output(model: ModelInfo, tp_size: int, backend: str, cb: int, max_num_seqs: int, max_model_len: int, warmup_shapes: DecodeWarmupShapes, monkeypatch: MonkeyPatch, use_llm_cache) -> None

The warmup is based on a single shape. After the warmup, one request with the provided prompts is input to vLLM. The same prompts are also input to HF. The generated output including text, token ids, and logprobs, is verified to be identical for vLLM and HF.

Configuration for CB - parameters are combinatorial

max_num_seqs: 4
tensor parallelism: 1, 2, 4, 8
number of prompts: 4 (Chicken soup prompts)
max tokens: 20 (same for all the prompts)

Source code in tests/e2e/test_spyre_basic.py

@pytest.mark.full_model
def test_output(model: ModelInfo, tp_size: int, backend: str, cb: int,
                max_num_seqs: int, max_model_len: int,
                warmup_shapes: DecodeWarmupShapes,
                monkeypatch: pytest.MonkeyPatch, use_llm_cache) -> None:
    '''
    The warmup is based on a single shape. After the warmup,
    one request with the provided prompts is input to vLLM.
    The same prompts are also input to HF. The generated output
    including text, token ids, and logprobs, is verified to be
    identical for vLLM and HF.

    Configuration for CB - parameters are combinatorial:
        * max_num_seqs: 4
        * tensor parallelism: 1, 2, 4, 8
        * number of prompts: 4 (Chicken soup prompts)
        * max tokens: 20 (same for all the prompts)
    '''

    skip_unsupported_tp_size(tp_size, backend)

    prompts = get_chicken_soup_prompts(4)

    kwargs = ({
        "max_num_seqs": max_num_seqs,
        "use_cb": True,
    } if cb == 1 else {
        "warmup_shapes": warmup_shapes,
    })

    max_new_tokens = warmup_shapes[0][1]

    vllm_sampling_params = SamplingParams(
        max_tokens=max_new_tokens,
        temperature=0,
        logprobs=0,  # return logprobs of generated tokens only
        ignore_eos=True)

    vllm_results = generate_spyre_vllm_output(
        model=model,
        prompts=prompts,
        sampling_params=vllm_sampling_params,
        tensor_parallel_size=tp_size,
        backend=backend,
        monkeypatch=monkeypatch,
        max_model_len=max_model_len,
        **kwargs)
    check_output_against_hf(model, backend, max_new_tokens, vllm_results,
                            prompts)

test_batch_handling ¶

test_batch_handling(model: ModelInfo, backend: str, cb: int, warmup_shapes, max_num_seqs: int, max_model_len: int, monkeypatch: MonkeyPatch, use_llm_cache)

Test that the spyre worker correctly handles continuous batches of requests that finish after different numbers of forward passes

Configuration for CB - parameters are combinatorial

max_num_seqs: 2
number of prompts: 4 (Chicken soup prompts)
max tokens: [5, 20, 10, 5]

Source code in tests/e2e/test_spyre_basic.py

def test_batch_handling(model: ModelInfo, backend: str, cb: int, warmup_shapes,
                        max_num_seqs: int, max_model_len: int,
                        monkeypatch: pytest.MonkeyPatch, use_llm_cache):
    """Test that the spyre worker correctly handles
    continuous batches of requests that
    finish after different numbers of forward passes

    Configuration for CB - parameters are combinatorial:
        * max_num_seqs: 2
        * number of prompts: 4 (Chicken soup prompts)
        * max tokens: [5, 20, 10, 5]
    """

    prompts = get_chicken_soup_prompts(4)

    max_new_tokens = [5, 20, 10, 5]

    vllm_sampling_params = [
        SamplingParams(max_tokens=max_new_tokens[i],
                       min_tokens=max_new_tokens[i],
                       temperature=0,
                       ignore_eos=True,
                       logprobs=0) for i in range(len(max_new_tokens))
    ]

    kwargs = {
        "max_num_seqs": max_num_seqs,
        "use_cb": True
    } if cb == 1 else {
        "warmup_shapes": warmup_shapes
    }

    vllm_results = generate_spyre_vllm_output(
        model=model,
        prompts=prompts,
        max_model_len=max_model_len,
        sampling_params=vllm_sampling_params,
        tensor_parallel_size=1,
        backend=backend,
        monkeypatch=monkeypatch,
        **kwargs)

    check_output_against_hf(model, backend, max_new_tokens, vllm_results,
                            prompts)