robgreenberg3 jennyyyi commited on
Commit
d18f5a7
·
verified ·
1 Parent(s): 1b3b0d1

Update README.md (#2)

Browse files

- Update README.md (c4148ed93adc727a2572dc847322ee7706c1b777)


Co-authored-by: Jenny Y <[email protected]>

Files changed (1) hide show
  1. README.md +195 -2
README.md CHANGED
@@ -20,8 +20,14 @@ widget:
20
  content: How should I explain the Internet?
21
  library_name: transformers
22
  ---
23
-
24
- # Phi-4 Model Card
 
 
 
 
 
 
25
 
26
  [Phi-4 Technical Report](https://arxiv.org/pdf/2412.08905)
27
 
@@ -50,6 +56,193 @@ library_name: transformers
50
  | **Primary Use Cases** | Our model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require:<br><br>1. Memory/compute constrained environments.<br>2. Latency bound scenarios.<br>3. Reasoning and logic. |
51
  | **Out-of-Scope Use Cases** | Our models is not specifically designed or evaluated for all downstream purposes, thus:<br><br>1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.<br>2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English.<br>3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. |
52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ## Data Overview
54
 
55
  ### Training Datasets
 
20
  content: How should I explain the Internet?
21
  library_name: transformers
22
  ---
23
+ <h1 style="display: flex; align-items: center; gap: 10px; margin: 0;">
24
+ Phi-4
25
+ <img src="https://www.redhat.com/rhdc/managed-files/Catalog-Validated_model_0.png" alt="Model Icon" width="40" style="margin: 0; padding: 0;" />
26
+ </h1>
27
+
28
+ <a href="https://www.redhat.com/en/products/ai/validated-models" target="_blank" style="margin: 0; padding: 0;">
29
+ <img src="https://www.redhat.com/rhdc/managed-files/Validated_badge-Dark.png" alt="Validated Badge" width="250" style="margin: 0; padding: 0;" />
30
+ </a>
31
 
32
  [Phi-4 Technical Report](https://arxiv.org/pdf/2412.08905)
33
 
 
56
  | **Primary Use Cases** | Our model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require:<br><br>1. Memory/compute constrained environments.<br>2. Latency bound scenarios.<br>3. Reasoning and logic. |
57
  | **Out-of-Scope Use Cases** | Our models is not specifically designed or evaluated for all downstream purposes, thus:<br><br>1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios.<br>2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English.<br>3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. |
58
 
59
+ ## Deployment
60
+
61
+ This model can be deployed efficiently on vLLM, Red Hat Enterprise Linux AI, and Openshift AI, as shown in the example below.
62
+
63
+ Deploy on <strong>vLLM</strong>
64
+
65
+ ```python
66
+ from vllm import LLM, SamplingParams
67
+
68
+ from transformers import AutoTokenizer
69
+
70
+ model_id = "RedHatAI/phi-4"
71
+ number_gpus = 1
72
+
73
+ sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256)
74
+
75
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
76
+
77
+ prompt = "Give me a short introduction to large language model."
78
+
79
+ llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
80
+
81
+ outputs = llm.generate(prompt, sampling_params)
82
+
83
+ generated_text = outputs[0].outputs[0].text
84
+ print(generated_text)
85
+ ```
86
+
87
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
88
+
89
+ <details>
90
+ <summary>Deploy on <strong>Red Hat AI Inference Server</strong></summary>
91
+
92
+ ```bash
93
+ $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
94
+ --ipc=host \
95
+ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
96
+ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
97
+ --name=vllm \
98
+ registry.access.redhat.com/rhaiis/rh-vllm-cuda \
99
+ vllm serve \
100
+ --tensor-parallel-size 1 \
101
+ --max-model-len 32768 \
102
+ --enforce-eager --model RedHatAI/phi-4
103
+ ```
104
+ ​​See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.
105
+ </details>
106
+
107
+ <details>
108
+ <summary>Deploy on <strong>Red Hat Enterprise Linux AI</strong></summary>
109
+
110
+ ```bash
111
+ # Download model from Red Hat Registry via docker
112
+ # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified.
113
+ ilab model download --repository docker://registry.redhat.io/rhelai1/phi-4:1.5
114
+ ```
115
+
116
+ ```bash
117
+ # Serve model via ilab
118
+ ilab model serve --model-path ~/.cache/instructlab/models/phi-4 --gpu 1
119
+
120
+ # Chat with model
121
+ ilab model chat --model ~/.cache/instructlab/models/phi-4
122
+ ```
123
+ See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.
124
+ </details>
125
+
126
+ <details>
127
+ <summary>Deploy on <strong>Red Hat Openshift AI</strong></summary>
128
+
129
+ ```python
130
+ # Setting up vllm server with ServingRuntime
131
+ # Save as: vllm-servingruntime.yaml
132
+ apiVersion: serving.kserve.io/v1alpha1
133
+ kind: ServingRuntime
134
+ metadata:
135
+ name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
136
+ annotations:
137
+ openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
138
+ opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
139
+ labels:
140
+ opendatahub.io/dashboard: 'true'
141
+ spec:
142
+ annotations:
143
+ prometheus.io/port: '8080'
144
+ prometheus.io/path: '/metrics'
145
+ multiModel: false
146
+ supportedModelFormats:
147
+ - autoSelect: true
148
+ name: vLLM
149
+ containers:
150
+ - name: kserve-container
151
+ image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm
152
+ command:
153
+ - python
154
+ - -m
155
+ - vllm.entrypoints.openai.api_server
156
+ args:
157
+ - "--port=8080"
158
+ - "--model=/mnt/models"
159
+ - "--served-model-name={{.Name}}"
160
+ env:
161
+ - name: HF_HOME
162
+ value: /tmp/hf_home
163
+ ports:
164
+ - containerPort: 8080
165
+ protocol: TCP
166
+ ```
167
+
168
+ ```python
169
+ # Attach model to vllm server. This is an NVIDIA template
170
+ # Save as: inferenceservice.yaml
171
+ apiVersion: serving.kserve.io/v1beta1
172
+ kind: InferenceService
173
+ metadata:
174
+ annotations:
175
+ openshift.io/display-name: phi-4 # OPTIONAL CHANGE
176
+ serving.kserve.io/deploymentMode: RawDeployment
177
+ name: phi-4 # specify model name. This value will be used to invoke the model in the payload
178
+ labels:
179
+ opendatahub.io/dashboard: 'true'
180
+ spec:
181
+ predictor:
182
+ maxReplicas: 1
183
+ minReplicas: 1
184
+ model:
185
+ modelFormat:
186
+ name: vLLM
187
+ name: ''
188
+ resources:
189
+ limits:
190
+ cpu: '2' # this is model specific
191
+ memory: 8Gi # this is model specific
192
+ nvidia.com/gpu: '1' # this is accelerator specific
193
+ requests: # same comment for this block
194
+ cpu: '1'
195
+ memory: 4Gi
196
+ nvidia.com/gpu: '1'
197
+ runtime: vllm-cuda-runtime # must match the ServingRuntime name above
198
+ storageUri: oci://registry.redhat.io/rhelai1/modelcar-phi-4:1.5
199
+ tolerations:
200
+ - effect: NoSchedule
201
+ key: nvidia.com/gpu
202
+ operator: Exists
203
+ ```
204
+
205
+ ```bash
206
+ # make sure first to be in the project where you want to deploy the model
207
+ # oc project <project-name>
208
+
209
+ # apply both resources to run model
210
+
211
+ # Apply the ServingRuntime
212
+ oc apply -f vllm-servingruntime.yaml
213
+
214
+ # Apply the InferenceService
215
+ oc apply -f qwen-inferenceservice.yaml
216
+ ```
217
+
218
+ ```python
219
+ # Replace <inference-service-name> and <cluster-ingress-domain> below:
220
+ # - Run `oc get inferenceservice` to find your URL if unsure.
221
+
222
+ # Call the server using curl:
223
+ curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions
224
+ -H "Content-Type: application/json" \
225
+ -d '{
226
+ "model": "Llama-4-Maverick-17B-128E-Instruct-FP8",
227
+ "stream": true,
228
+ "stream_options": {
229
+ "include_usage": true
230
+ },
231
+ "max_tokens": 1,
232
+ "messages": [
233
+ {
234
+ "role": "user",
235
+ "content": "How can a bee fly when its wings are so small?"
236
+ }
237
+ ]
238
+ }'
239
+
240
+ ```
241
+
242
+ See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.
243
+ </details>
244
+
245
+
246
  ## Data Overview
247
 
248
  ### Training Datasets