Spaces:

Shivdutta
/

S30-MultiModalGPT

Runtime error

Shivdutta commited on Oct 9, 2024

Commit

154fe42

verified ·

1 Parent(s): eb10d89

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -229,7 +229,13 @@ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-
 ### Huggingface Gradio Apps:
-    -    The app.py script is a multimodal AI application that integrates image, audio, and text inputs using pre-trained models like CLIP (for vision tasks), Phi-2 (for text generation), and WhisperX (for audio transcription). The script sets up tokenizers and processors for handling inputs and defines a custom residual block (SimpleResBlock) to transform embeddings for more stable learning. After loading pretrained and fine-tuned weights for both the projection and residual layers, it implements the model_generate_ans function, which processes inputs from different modalities, combines their embeddings, and generates responses sequentially. This model handles tasks like image embedding extraction, audio transcription and embedding, and text tokenization to predict responses. The app features a Gradio interface where users can upload images, record or upload audio, and submit text queries, receiving multimodal answers through a web interface. This interactive application is designed for seamless, multi-input AI tasks using advanced model architectures.
 https://huggingface.co/spaces/Shivdutta/S30-MultiModalGPT

 ### Huggingface Gradio Apps:
+    -    The app.py script is a multimodal AI application that integrates image, audio, and text inputs using pre-trained models like CLIP (for vision tasks), Phi-2
+    (for text generation), and WhisperX (for audio transcription). The script sets up tokenizers and processors for handling inputs and defines a custom residual
+    block (SimpleResBlock) to transform embeddings for more stable learning. After loading pretrained and fine-tuned weights for both the projection and residual layers,
+    it implements the model_generate_ans function, which processes inputs from different modalities, combines their embeddings, and generates responses sequentially.
+    This model handles tasks like image embedding extraction, audio transcription and embedding, and text tokenization to predict responses. The app features a Gradio
+    interface where users can upload images, record or upload audio, and submit text queries, receiving multimodal answers through a web interface. This interactive
+    application is designed for seamless, multi-input AI tasks using advanced model architectures.
 https://huggingface.co/spaces/Shivdutta/S30-MultiModalGPT