Spaces:
Runtime error
Runtime error
Update README.md
Browse files
README.md
CHANGED
@@ -229,7 +229,13 @@ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-
|
|
229 |
|
230 |
### Huggingface Gradio Apps:
|
231 |
|
232 |
-
- The app.py script is a multimodal AI application that integrates image, audio, and text inputs using pre-trained models like CLIP (for vision tasks), Phi-2
|
|
|
|
|
|
|
|
|
|
|
|
|
233 |
|
234 |
https://huggingface.co/spaces/Shivdutta/S30-MultiModalGPT
|
235 |
|
|
|
229 |
|
230 |
### Huggingface Gradio Apps:
|
231 |
|
232 |
+
- The app.py script is a multimodal AI application that integrates image, audio, and text inputs using pre-trained models like CLIP (for vision tasks), Phi-2
|
233 |
+
(for text generation), and WhisperX (for audio transcription). The script sets up tokenizers and processors for handling inputs and defines a custom residual
|
234 |
+
block (SimpleResBlock) to transform embeddings for more stable learning. After loading pretrained and fine-tuned weights for both the projection and residual layers,
|
235 |
+
it implements the model_generate_ans function, which processes inputs from different modalities, combines their embeddings, and generates responses sequentially.
|
236 |
+
This model handles tasks like image embedding extraction, audio transcription and embedding, and text tokenization to predict responses. The app features a Gradio
|
237 |
+
interface where users can upload images, record or upload audio, and submit text queries, receiving multimodal answers through a web interface. This interactive
|
238 |
+
application is designed for seamless, multi-input AI tasks using advanced model architectures.
|
239 |
|
240 |
https://huggingface.co/spaces/Shivdutta/S30-MultiModalGPT
|
241 |
|