ghostai1 commited on
Commit
0cab41e
·
verified ·
1 Parent(s): 376cb8f

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +41 -43
index.html CHANGED
@@ -75,7 +75,7 @@
75
  <!-- Header -->
76
  <header class="text-center mb-5">
77
  <h1>XTTVS-MED</h1>
78
- <p class="lead">Real-time 4-Bit Semantic Voice Cloning for Emergency & Accessibility</p>
79
  <p><strong>Chris Coleman</strong> &mdash; GhostAI Labs<br>
80
  <strong>Dr. Anthony Becker, M.D.</strong> &mdash; Medical Advisor
81
  </p>
@@ -85,17 +85,19 @@
85
  <section id="overview" class="mb-5">
86
  <h2>1. Overview</h2>
87
  <p>
88
- XTTVS-MED fuses aggressive 4-bit quantization, LoRA speaker/emotion adapters, and a float-aligned CBR-RTree scheduler to generate emotion-aware, multilingual speech with sub-second latency on devices with ≥6 GB VRAM.
 
89
  </p>
90
  <div class="diagram mermaid">
91
  flowchart LR
92
- T["Input Text"] --> S["Split & Preprocess<br/>(Mel-Spectrogram)"]
 
93
  S --> L["LoRA Adapters<br/>(Speaker/Emotion/Urgency)"]
94
  L --> Q["FloatBin Quantization<br/>(FP32→FP16→INT4)"]
95
  Q --> C["CBR-RTree Scheduler<br/>(Urgency/Pitch/Emotion)"]
96
  C --> M["XTTSv2 Transformer"]
97
  M --> V["Vocoder<br/>(WaveRNN/HiFiGAN)"]
98
- V --> A["Output Audio"]
99
  </div>
100
  </section>
101
 
@@ -107,36 +109,33 @@ sequenceDiagram
107
  participant U as User
108
  participant G as Gradio UI
109
  participant A as FastAPI
110
- participant M as Model Server
 
111
  participant D as Disk(outputs/)
112
- U->>G: Enter text & params
113
- G->>A: POST /voice
114
- A->>M: gen_voice(...)
115
- M-->>A: audio + metrics
116
- A->>G: return file + info
 
 
117
  A->>D: save MP3
118
  </div>
119
  <pre>
120
- // Pseudocode: CBR-RTree insertion & retrieval
121
- def insert(node, v4, t_fp):
122
- if not node: return Node(v4, t_fp)
123
- if t_fp < node.t_fp:
124
- node.left = insert(node.left, v4, t_fp)
125
- else:
126
- node.right = insert(node.right, v4, t_fp)
127
- return node
128
-
129
- def retrieve(node, t_target):
130
- if not node: return None
131
- child = node.left if abs(node.left.t_fp - t_target) < abs(node.right.t_fp - t_target) else node.right
132
- return retrieve(child, t_target) or child
133
  </pre>
134
  </section>
135
 
136
  <!-- Performance -->
137
  <section id="performance" class="mb-5">
138
  <h2>3. Hardware Scalability & Throughput</h2>
139
- <p>Fully on-premise, HIPAA/GDPR compliant, running on:</p>
140
  <div class="diagram mermaid">
141
  flowchart TB
142
  HF200["HF200 Cluster<br/>0.15 s"] --> H100["DGX H100<br/>0.25 s"]
@@ -180,47 +179,46 @@ flowchart TB
180
 
181
  <!-- Translation + Quick LoRA -->
182
  <section id="translation" class="mb-5">
183
- <h2>4. Translation & Quick LoRA Epoch Training</h2>
184
  <p>
185
- XTTVS-MED auto-detects 50+ languages in ≤200 ms via an acoustic n-gram classifier.
186
- For unsupported dialects, a <strong>quick LoRA epoch</strong>—using 1–2 hrs of local audio—adapts the base model in under 30 minutes.
187
  </p>
188
  <div class="diagram mermaid">
189
  flowchart LR
190
- D["Dialect Audio (1–2 hrs)"]
191
  --> P["Preprocess & Align"]
192
- --> T["Train LoRA Epoch<br/>(5–10 epochs)"]
193
  --> U["Updated Adapters"]
194
- --> M["Inference Pipeline"]
195
  </div>
196
  <ul>
197
- <li><strong>Step 1:</strong> Record ~1 hr of target dialect speech.</li>
198
- <li><strong>Step 2:</strong> Extract Mel-spectrograms, align to transcripts.</li>
199
- <li><strong>Step 3:</strong> Train LoRA adapters for speaker + dialect (5–10 epochs, 30 min).</li>
200
- <li><strong>Step 4:</strong> Deploy updated adapters; new dialect instantaneously available.</li>
201
  </ul>
202
  </section>
203
 
204
  <!-- Clinical Impact -->
205
  <section id="impact" class="mb-5">
206
- <h2>5. Clinical Impact & Data Science</h2>
207
  <p>
208
- Each second saved in emergency care reduces mortality risk by ~7%. XTTVS-MED’s
209
- 200 ms detection + <1 s synthesis can improve survival by 10–15% for non-native speakers.
210
  </p>
211
  <div class="row">
212
  <div class="col-md-6">
213
  <div class="accessibility-box">
214
- ⚠️ “Blood pressure critically low—initiate IV fluids immediately.”
215
- <br>[Dual-text & audio UI]
216
  </div>
217
  </div>
218
  <div class="col-md-6">
219
- <p><strong>Dataset & Validation:</strong></p>
220
  <ul>
221
- <li>600 hrs multilingual clinical dialogues</li>
222
  <li>ANOVA on MOS (p &lt; 0.01)</li>
223
- <li>Speaker similarity ≥ 92%; intelligibility MOS ≥ 4.5/5</li>
224
  </ul>
225
  </div>
226
  </div>
@@ -231,7 +229,7 @@ flowchart LR
231
  <h2>6. BibTeX</h2>
232
  <pre>@article{coleman2025xttvmed,
233
  author = {Coleman, Chris and Becker, Anthony},
234
- title = {XTTVS-MED: Real-Time Semantic 4-Bit Voice Cloning to Prevent Medical Miscommunication},
235
  journal = {GhostAI Labs},
236
  year = {2025}
237
  }</pre>
 
75
  <!-- Header -->
76
  <header class="text-center mb-5">
77
  <h1>XTTVS-MED</h1>
78
+ <p class="lead">Real-time 4-Bit Semantic Voice Cloning & Voice-to-Voice Translation</p>
79
  <p><strong>Chris Coleman</strong> &mdash; GhostAI Labs<br>
80
  <strong>Dr. Anthony Becker, M.D.</strong> &mdash; Medical Advisor
81
  </p>
 
85
  <section id="overview" class="mb-5">
86
  <h2>1. Overview</h2>
87
  <p>
88
+ XTTVS-MED fuses Whisper ASR, 4-bit quantization, LoRA adapters, and a float-aligned CBR-RTree scheduler
89
+ to deliver sub-second, emotion-aware, multilingual voice-to-voice translation on devices ≥6 GB VRAM.
90
  </p>
91
  <div class="diagram mermaid">
92
  flowchart LR
93
+ A["Input Audio"] --> W["Whisper ASR<br/>(Transcribe/Detect Lang)"]
94
+ W --> S["Normalize & Preprocess<br/>(Mel-Spectrogram)"]
95
  S --> L["LoRA Adapters<br/>(Speaker/Emotion/Urgency)"]
96
  L --> Q["FloatBin Quantization<br/>(FP32→FP16→INT4)"]
97
  Q --> C["CBR-RTree Scheduler<br/>(Urgency/Pitch/Emotion)"]
98
  C --> M["XTTSv2 Transformer"]
99
  M --> V["Vocoder<br/>(WaveRNN/HiFiGAN)"]
100
+ V --> B["Output Audio"]
101
  </div>
102
  </section>
103
 
 
109
  participant U as User
110
  participant G as Gradio UI
111
  participant A as FastAPI
112
+ participant W as Whisper
113
+ participant M as TTS Server
114
  participant D as Disk(outputs/)
115
+ U->>G: Record/Input Audio
116
+ G->>A: POST /voice2voice
117
+ A->>W: Whisper.transcribe(audio)
118
+ W-->>A: text + lang
119
+ A->>M: gen_voice(text, lang, settings)
120
+ M-->>A: synthesized audio + metrics
121
+ A->>G: return output audio & info
122
  A->>D: save MP3
123
  </div>
124
  <pre>
125
+ // Pseudocode: Voice-to-Voice pipeline with CBR-RTree
126
+ def voice2voice(audio):
127
+ text, lang = whisper.transcribe(audio)
128
+ v4, t_fp = preprocess(text)
129
+ node = insert(None, v4, t_fp)
130
+ best = retrieve(node, t_fp)
131
+ return tts.generate(text, adapter=best.adapter)
 
 
 
 
 
 
132
  </pre>
133
  </section>
134
 
135
  <!-- Performance -->
136
  <section id="performance" class="mb-5">
137
  <h2>3. Hardware Scalability & Throughput</h2>
138
+ <p>On-premise, HIPAA/GDPR compliant, supporting:</p>
139
  <div class="diagram mermaid">
140
  flowchart TB
141
  HF200["HF200 Cluster<br/>0.15 s"] --> H100["DGX H100<br/>0.25 s"]
 
179
 
180
  <!-- Translation + Quick LoRA -->
181
  <section id="translation" class="mb-5">
182
+ <h2>4. Quick LoRA Epoch Training</h2>
183
  <p>
184
+ For unsupported dialects: record 1–2 hrs of local speech, then train LoRA adapters—5–10 epochs in <strong>30 min</strong>—to extend coverage instantly.
 
185
  </p>
186
  <div class="diagram mermaid">
187
  flowchart LR
188
+ D["Dialect Samples (1–2 hrs)"]
189
  --> P["Preprocess & Align"]
190
+ --> T["Train LoRA Epochs<br/>(5–10)"]
191
  --> U["Updated Adapters"]
192
+ --> I["Immediate Inference"]
193
  </div>
194
  <ul>
195
+ <li>Step 1: Capture ~1 hr dialect audio.</li>
196
+ <li>Step 2: Generate aligned spectrograms.</li>
197
+ <li>Step 3: Fine-tune LoRA adapters (30 min).</li>
198
+ <li>Step 4: Deploy instantly for voice-to-voice.</li>
199
  </ul>
200
  </section>
201
 
202
  <!-- Clinical Impact -->
203
  <section id="impact" class="mb-5">
204
+ <h2>5. Clinical Impact & Validation</h2>
205
  <p>
206
+ Every second saved reduces mortality by ~7%.
207
+ Audio-to-audio translation in <1 s can improve survival by 10–15% for non-native speakers.
208
  </p>
209
  <div class="row">
210
  <div class="col-md-6">
211
  <div class="accessibility-box">
212
+ ⚠️ “Blood pressure critically low—initiate IV fluids immediately.”<br/>
213
+ [Dual-text & audio UI]
214
  </div>
215
  </div>
216
  <div class="col-md-6">
217
+ <p><strong>Dataset & Metrics:</strong></p>
218
  <ul>
219
+ <li>600 hrs clinical dialogues</li>
220
  <li>ANOVA on MOS (p &lt; 0.01)</li>
221
+ <li>Speaker similarity ≥ 92%; MOS intelligibility ≥ 4.5/5</li>
222
  </ul>
223
  </div>
224
  </div>
 
229
  <h2>6. BibTeX</h2>
230
  <pre>@article{coleman2025xttvmed,
231
  author = {Coleman, Chris and Becker, Anthony},
232
+ title = {XTTVS-MED: Real-Time Voice-to-Voice Semantic Cloning to Prevent Medical Miscommunication},
233
  journal = {GhostAI Labs},
234
  year = {2025}
235
  }</pre>