RedHatAI
/

phi-4

@@ -20,8 +20,14 @@ widget:
     content: How should I explain the Internet?
 library_name: transformers
 ---
-# Phi-4 Model Card
 [Phi-4 Technical Report](https://arxiv.org/pdf/2412.08905)
@@ -50,6 +56,193 @@ library_name: transformers
 | **Primary Use Cases**         | Our model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require:<br><br>1. Memory/compute constrained environments.<br>2. Latency bound scenarios.<br>3. Reasoning and logic.                                                                       |
 | **Out-of-Scope Use Cases**    | Our models is not specifically designed or evaluated for all downstream purposes, thus:<br><br>1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.<br>2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English.<br>3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.                                                              |
 ## Data Overview
 ### Training Datasets

     content: How should I explain the Internet?
 library_name: transformers
 ---
+<h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
+  Phi-4
+  <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
+</h1>
+<a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
+<img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
+</a>
 [Phi-4 Technical Report](https://arxiv.org/pdf/2412.08905)
 | **Primary Use Cases**         | Our model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require:<br><br>1. Memory/compute constrained environments.<br>2. Latency bound scenarios.<br>3. Reasoning and logic.                                                                       |
 | **Out-of-Scope Use Cases**    | Our models is not specifically designed or evaluated for all downstream purposes, thus:<br><br>1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.<br>2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English.<br>3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.                                                              |
+## Deployment
+This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
+Deploy on <strong>vLLM</strong>
+```python
+from vllm import LLM, SamplingParams
+from transformers import AutoTokenizer
+model_id = "RedHatAI/phi-4"
+number_gpus = 1
+sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+prompt = "Give me a short introduction to large language model."
+llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
+outputs = llm.generate(prompt, sampling_params)
+generated_text = outputs[0].outputs[0].text
+print(generated_text)
+```
+vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
+<details>
+  <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
+```bash
+$ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
+ --ipc=host \
+--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
+--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
+--name=vllm \
+registry.access.redhat.com/rhaiis/rh-vllm-cuda \
+vllm serve \
+--tensor-parallel-size 1 \
+--max-model-len 32768  \
+--enforce-eager --model RedHatAI/phi-4
+```
+See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
+</details>
+<details>
+  <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
+```bash
+# Download model from Red Hat Registry via docker
+# Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
+ilab model download --repository docker://registry.redhat.io/rhelai1/phi-4:1.5
+```
+```bash
+# Serve model via ilab
+ilab model serve --model-path ~/.cache/instructlab/models/phi-4 --gpu 1
+# Chat with model
+ilab model chat --model ~/.cache/instructlab/models/phi-4
+```
+See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
+</details>
+<details>
+  <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
+```python
+# Setting up vllm server with ServingRuntime
+# Save as: vllm-servingruntime.yaml
+apiVersion: serving.kserve.io/v1alpha1
+kind: ServingRuntime
+metadata:
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
+ annotations:
+   openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
+   opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
+ labels:
+   opendatahub.io/dashboard: 'true'
+spec:
+ annotations:
+   prometheus.io/port: '8080'
+   prometheus.io/path: '/metrics'
+ multiModel: false
+ supportedModelFormats:
+   - autoSelect: true
+     name: vLLM
+ containers:
+   - name: kserve-container
+     image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
+     command:
+       - python
+       - -m
+       - vllm.entrypoints.openai.api_server
+     args:
+       - "--port=8080"
+       - "--model=/mnt/models"
+       - "--served-model-name={{.Name}}"
+     env:
+       - name: HF_HOME
+         value: /tmp/hf_home
+     ports:
+       - containerPort: 8080
+         protocol: TCP
+```
+```python
+# Attach model to vllm server. This is an NVIDIA template
+# Save as: inferenceservice.yaml
+apiVersion: serving.kserve.io/v1beta1
+kind: InferenceService
+metadata:
+  annotations:
+    openshift.io/display-name: phi-4 # OPTIONAL CHANGE
+    serving.kserve.io/deploymentMode: RawDeployment
+  name: phi-4          # specify model name. This value will be used to invoke the model in the payload
+  labels:
+    opendatahub.io/dashboard: 'true'
+spec:
+  predictor:
+    maxReplicas: 1
+    minReplicas: 1
+    model:
+      modelFormat:
+        name: vLLM
+      name: ''
+      resources:
+        limits:
+          cpu: '2'			# this is model specific
+          memory: 8Gi		# this is model specific
+          nvidia.com/gpu: '1'	# this is accelerator specific
+        requests:			# same comment for this block
+          cpu: '1'
+          memory: 4Gi
+          nvidia.com/gpu: '1'
+      runtime: vllm-cuda-runtime	# must match the ServingRuntime name above
+      storageUri: oci://registry.redhat.io/rhelai1/modelcar-phi-4:1.5
+    tolerations:
+    - effect: NoSchedule
+      key: nvidia.com/gpu
+      operator: Exists
+```
+```bash
+# make sure first to be in the project where you want to deploy the model
+# oc project <project-name>
+# apply both resources to run model
+# Apply the ServingRuntime
+oc apply -f vllm-servingruntime.yaml
+# Apply the InferenceService
+oc apply -f qwen-inferenceservice.yaml
+```
+```python
+# Replace <inference-service-name> and <cluster-ingress-domain> below:
+# - Run `oc get inferenceservice` to find your URL if unsure.
+# Call the server using curl:
+curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
+        -H "Content-Type: application/json" \
+        -d '{
+    "model": "Llama-4-Maverick-17B-128E-Instruct-FP8",
+    "stream": true,
+    "stream_options": {
+        "include_usage": true
+    },
+    "max_tokens": 1,
+    "messages": [
+        {
+            "role": "user",
+            "content": "How can a bee fly when its wings are so small?"
+        }
+    ]
+}'
+```
+See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
+</details>
 ## Data Overview
 ### Training Datasets