Using Red Hat OpenShift AI¶
Red Hat OpenShift AI is a cloud-native AI platform that bundles together many popular model management projects, including KServe.
This example shows how to use KServe with RHOAI to deploy a model on OpenShift, using a modelcar image to load the model without requiring any connection to Huggingface Hub.
Deploying with KServe¶
Prerequisites
- A running Kubernetes cluster with RHOAI installed
- Image pull credentials for
registry.redhat.io/rhelai1
- Spyre accelerators available in the cluster
-
Create a ServingRuntime to serve your models.
oc apply -f - <<EOF apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: vllm-spyre-runtime annotations: openshift.io/display-name: vLLM IBM Spyre ServingRuntime for KServe opendatahub.io/recommended-accelerators: '["ibm.com/aiu_pf"]' labels: opendatahub.io/dashboard: "true" spec: multiModel: false supportedModelFormats: - autoSelect: true name: vLLM containers: - name: kserve-container image: quay.io/ibm-aiu/vllm-spyre:latest.amd64 args: - /mnt/models - --served-model-name={{.Name}} env: - name: HF_HOME value: /tmp/hf_home # Static batching configurations can also be set on each InferenceService - name: VLLM_SPYRE_WARMUP_BATCH_SIZES value: '4' - name: VLLM_SPYRE_WARMUP_PROMPT_LENS value: '1024' - name: VLLM_SPYRE_WARMUP_NEW_TOKENS value: '256' ports: - containerPort: 8000 protocol: TCP EOF
-
Create an InferenceService for each model you want to deploy. This example demonstrates how to deploy the Granite model
ibm-granite/granite-3.1-8b-instruct
.oc apply -f - <<EOF apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: annotations: openshift.io/display-name: granite-3-1-8b-instruct serving.kserve.io/deploymentMode: RawDeployment name: granite-3-1-8b-instruct labels: opendatahub.io/dashboard: 'true' spec: predictor: imagePullSecrets: - name: oci-registry maxReplicas: 1 minReplicas: 1 model: modelFormat: name: vLLM name: '' resources: limits: ibm.com/aiu_pf: '1' requests: ibm.com/aiu_pf: '1' runtime: vllm-spyre-runtime storageUri: 'oci://registry.redhat.io/rhelai1/modelcar-granite-3-1-8b-instruct:1.5' volumeMounts: - mountPath: /dev/shm name: shm schedulerName: aiu-scheduler tolerations: - effect: NoSchedule key: ibm.com/aiu_pf operator: Exists spec: volumes: # This volume may need to be larger for bigger models and running tensor-parallel inference with more cards - name: shm emptyDir: medium: Memory sizeLimit: "2Gi" EOF
-
To test your InferenceService, refer to the KServe documentation on model inference with vLLM.