Shivdutta commited on
Commit
154fe42
Β·
verified Β·
1 Parent(s): eb10d89

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -1
README.md CHANGED
@@ -229,7 +229,13 @@ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-
229
 
230
  ### Huggingface Gradio Apps:
231
 
232
- - The app.py script is a multimodal AI application that integrates image, audio, and text inputs using pre-trained models like CLIP (for vision tasks), Phi-2 (for text generation), and WhisperX (for audio transcription). The script sets up tokenizers and processors for handling inputs and defines a custom residual block (SimpleResBlock) to transform embeddings for more stable learning. After loading pretrained and fine-tuned weights for both the projection and residual layers, it implements the model_generate_ans function, which processes inputs from different modalities, combines their embeddings, and generates responses sequentially. This model handles tasks like image embedding extraction, audio transcription and embedding, and text tokenization to predict responses. The app features a Gradio interface where users can upload images, record or upload audio, and submit text queries, receiving multimodal answers through a web interface. This interactive application is designed for seamless, multi-input AI tasks using advanced model architectures.
 
 
 
 
 
 
233
 
234
  https://huggingface.co/spaces/Shivdutta/S30-MultiModalGPT
235
 
 
229
 
230
  ### Huggingface Gradio Apps:
231
 
232
+ - The app.py script is a multimodal AI application that integrates image, audio, and text inputs using pre-trained models like CLIP (for vision tasks), Phi-2
233
+ (for text generation), and WhisperX (for audio transcription). The script sets up tokenizers and processors for handling inputs and defines a custom residual
234
+ block (SimpleResBlock) to transform embeddings for more stable learning. After loading pretrained and fine-tuned weights for both the projection and residual layers,
235
+ it implements the model_generate_ans function, which processes inputs from different modalities, combines their embeddings, and generates responses sequentially.
236
+ This model handles tasks like image embedding extraction, audio transcription and embedding, and text tokenization to predict responses. The app features a Gradio
237
+ interface where users can upload images, record or upload audio, and submit text queries, receiving multimodal answers through a web interface. This interactive
238
+ application is designed for seamless, multi-input AI tasks using advanced model architectures.
239
 
240
  https://huggingface.co/spaces/Shivdutta/S30-MultiModalGPT
241