Spaces:
Running
Running
Update index.html
Browse files- index.html +41 -43
index.html
CHANGED
@@ -75,7 +75,7 @@
|
|
75 |
<!-- Header -->
|
76 |
<header class="text-center mb-5">
|
77 |
<h1>XTTVS-MED</h1>
|
78 |
-
<p class="lead">Real-time 4-Bit Semantic Voice Cloning
|
79 |
<p><strong>Chris Coleman</strong> — GhostAI Labs<br>
|
80 |
<strong>Dr. Anthony Becker, M.D.</strong> — Medical Advisor
|
81 |
</p>
|
@@ -85,17 +85,19 @@
|
|
85 |
<section id="overview" class="mb-5">
|
86 |
<h2>1. Overview</h2>
|
87 |
<p>
|
88 |
-
XTTVS-MED fuses
|
|
|
89 |
</p>
|
90 |
<div class="diagram mermaid">
|
91 |
flowchart LR
|
92 |
-
|
|
|
93 |
S --> L["LoRA Adapters<br/>(Speaker/Emotion/Urgency)"]
|
94 |
L --> Q["FloatBin Quantization<br/>(FP32→FP16→INT4)"]
|
95 |
Q --> C["CBR-RTree Scheduler<br/>(Urgency/Pitch/Emotion)"]
|
96 |
C --> M["XTTSv2 Transformer"]
|
97 |
M --> V["Vocoder<br/>(WaveRNN/HiFiGAN)"]
|
98 |
-
V -->
|
99 |
</div>
|
100 |
</section>
|
101 |
|
@@ -107,36 +109,33 @@ sequenceDiagram
|
|
107 |
participant U as User
|
108 |
participant G as Gradio UI
|
109 |
participant A as FastAPI
|
110 |
-
participant
|
|
|
111 |
participant D as Disk(outputs/)
|
112 |
-
U->>G:
|
113 |
-
G->>A: POST /
|
114 |
-
A->>
|
115 |
-
|
116 |
-
A->>
|
|
|
|
|
117 |
A->>D: save MP3
|
118 |
</div>
|
119 |
<pre>
|
120 |
-
// Pseudocode:
|
121 |
-
def
|
122 |
-
|
123 |
-
|
124 |
-
|
125 |
-
|
126 |
-
|
127 |
-
return node
|
128 |
-
|
129 |
-
def retrieve(node, t_target):
|
130 |
-
if not node: return None
|
131 |
-
child = node.left if abs(node.left.t_fp - t_target) < abs(node.right.t_fp - t_target) else node.right
|
132 |
-
return retrieve(child, t_target) or child
|
133 |
</pre>
|
134 |
</section>
|
135 |
|
136 |
<!-- Performance -->
|
137 |
<section id="performance" class="mb-5">
|
138 |
<h2>3. Hardware Scalability & Throughput</h2>
|
139 |
-
<p>
|
140 |
<div class="diagram mermaid">
|
141 |
flowchart TB
|
142 |
HF200["HF200 Cluster<br/>0.15 s"] --> H100["DGX H100<br/>0.25 s"]
|
@@ -180,47 +179,46 @@ flowchart TB
|
|
180 |
|
181 |
<!-- Translation + Quick LoRA -->
|
182 |
<section id="translation" class="mb-5">
|
183 |
-
<h2>4.
|
184 |
<p>
|
185 |
-
|
186 |
-
For unsupported dialects, a <strong>quick LoRA epoch</strong>—using 1–2 hrs of local audio—adapts the base model in under 30 minutes.
|
187 |
</p>
|
188 |
<div class="diagram mermaid">
|
189 |
flowchart LR
|
190 |
-
D["Dialect
|
191 |
--> P["Preprocess & Align"]
|
192 |
-
--> T["Train LoRA
|
193 |
--> U["Updated Adapters"]
|
194 |
-
-->
|
195 |
</div>
|
196 |
<ul>
|
197 |
-
<li
|
198 |
-
<li
|
199 |
-
<li
|
200 |
-
<li
|
201 |
</ul>
|
202 |
</section>
|
203 |
|
204 |
<!-- Clinical Impact -->
|
205 |
<section id="impact" class="mb-5">
|
206 |
-
<h2>5. Clinical Impact &
|
207 |
<p>
|
208 |
-
|
209 |
-
|
210 |
</p>
|
211 |
<div class="row">
|
212 |
<div class="col-md-6">
|
213 |
<div class="accessibility-box">
|
214 |
-
⚠️ “Blood pressure critically low—initiate IV fluids immediately
|
215 |
-
|
216 |
</div>
|
217 |
</div>
|
218 |
<div class="col-md-6">
|
219 |
-
<p><strong>Dataset &
|
220 |
<ul>
|
221 |
-
<li>600 hrs
|
222 |
<li>ANOVA on MOS (p < 0.01)</li>
|
223 |
-
<li>Speaker similarity ≥ 92%; intelligibility
|
224 |
</ul>
|
225 |
</div>
|
226 |
</div>
|
@@ -231,7 +229,7 @@ flowchart LR
|
|
231 |
<h2>6. BibTeX</h2>
|
232 |
<pre>@article{coleman2025xttvmed,
|
233 |
author = {Coleman, Chris and Becker, Anthony},
|
234 |
-
title = {XTTVS-MED: Real-Time
|
235 |
journal = {GhostAI Labs},
|
236 |
year = {2025}
|
237 |
}</pre>
|
|
|
75 |
<!-- Header -->
|
76 |
<header class="text-center mb-5">
|
77 |
<h1>XTTVS-MED</h1>
|
78 |
+
<p class="lead">Real-time 4-Bit Semantic Voice Cloning & Voice-to-Voice Translation</p>
|
79 |
<p><strong>Chris Coleman</strong> — GhostAI Labs<br>
|
80 |
<strong>Dr. Anthony Becker, M.D.</strong> — Medical Advisor
|
81 |
</p>
|
|
|
85 |
<section id="overview" class="mb-5">
|
86 |
<h2>1. Overview</h2>
|
87 |
<p>
|
88 |
+
XTTVS-MED fuses Whisper ASR, 4-bit quantization, LoRA adapters, and a float-aligned CBR-RTree scheduler
|
89 |
+
to deliver sub-second, emotion-aware, multilingual voice-to-voice translation on devices ≥6 GB VRAM.
|
90 |
</p>
|
91 |
<div class="diagram mermaid">
|
92 |
flowchart LR
|
93 |
+
A["Input Audio"] --> W["Whisper ASR<br/>(Transcribe/Detect Lang)"]
|
94 |
+
W --> S["Normalize & Preprocess<br/>(Mel-Spectrogram)"]
|
95 |
S --> L["LoRA Adapters<br/>(Speaker/Emotion/Urgency)"]
|
96 |
L --> Q["FloatBin Quantization<br/>(FP32→FP16→INT4)"]
|
97 |
Q --> C["CBR-RTree Scheduler<br/>(Urgency/Pitch/Emotion)"]
|
98 |
C --> M["XTTSv2 Transformer"]
|
99 |
M --> V["Vocoder<br/>(WaveRNN/HiFiGAN)"]
|
100 |
+
V --> B["Output Audio"]
|
101 |
</div>
|
102 |
</section>
|
103 |
|
|
|
109 |
participant U as User
|
110 |
participant G as Gradio UI
|
111 |
participant A as FastAPI
|
112 |
+
participant W as Whisper
|
113 |
+
participant M as TTS Server
|
114 |
participant D as Disk(outputs/)
|
115 |
+
U->>G: Record/Input Audio
|
116 |
+
G->>A: POST /voice2voice
|
117 |
+
A->>W: Whisper.transcribe(audio)
|
118 |
+
W-->>A: text + lang
|
119 |
+
A->>M: gen_voice(text, lang, settings)
|
120 |
+
M-->>A: synthesized audio + metrics
|
121 |
+
A->>G: return output audio & info
|
122 |
A->>D: save MP3
|
123 |
</div>
|
124 |
<pre>
|
125 |
+
// Pseudocode: Voice-to-Voice pipeline with CBR-RTree
|
126 |
+
def voice2voice(audio):
|
127 |
+
text, lang = whisper.transcribe(audio)
|
128 |
+
v4, t_fp = preprocess(text)
|
129 |
+
node = insert(None, v4, t_fp)
|
130 |
+
best = retrieve(node, t_fp)
|
131 |
+
return tts.generate(text, adapter=best.adapter)
|
|
|
|
|
|
|
|
|
|
|
|
|
132 |
</pre>
|
133 |
</section>
|
134 |
|
135 |
<!-- Performance -->
|
136 |
<section id="performance" class="mb-5">
|
137 |
<h2>3. Hardware Scalability & Throughput</h2>
|
138 |
+
<p>On-premise, HIPAA/GDPR compliant, supporting:</p>
|
139 |
<div class="diagram mermaid">
|
140 |
flowchart TB
|
141 |
HF200["HF200 Cluster<br/>0.15 s"] --> H100["DGX H100<br/>0.25 s"]
|
|
|
179 |
|
180 |
<!-- Translation + Quick LoRA -->
|
181 |
<section id="translation" class="mb-5">
|
182 |
+
<h2>4. Quick LoRA Epoch Training</h2>
|
183 |
<p>
|
184 |
+
For unsupported dialects: record 1–2 hrs of local speech, then train LoRA adapters—5–10 epochs in <strong>30 min</strong>—to extend coverage instantly.
|
|
|
185 |
</p>
|
186 |
<div class="diagram mermaid">
|
187 |
flowchart LR
|
188 |
+
D["Dialect Samples (1–2 hrs)"]
|
189 |
--> P["Preprocess & Align"]
|
190 |
+
--> T["Train LoRA Epochs<br/>(5–10)"]
|
191 |
--> U["Updated Adapters"]
|
192 |
+
--> I["Immediate Inference"]
|
193 |
</div>
|
194 |
<ul>
|
195 |
+
<li>Step 1: Capture ~1 hr dialect audio.</li>
|
196 |
+
<li>Step 2: Generate aligned spectrograms.</li>
|
197 |
+
<li>Step 3: Fine-tune LoRA adapters (30 min).</li>
|
198 |
+
<li>Step 4: Deploy instantly for voice-to-voice.</li>
|
199 |
</ul>
|
200 |
</section>
|
201 |
|
202 |
<!-- Clinical Impact -->
|
203 |
<section id="impact" class="mb-5">
|
204 |
+
<h2>5. Clinical Impact & Validation</h2>
|
205 |
<p>
|
206 |
+
Every second saved reduces mortality by ~7%.
|
207 |
+
Audio-to-audio translation in <1 s can improve survival by 10–15% for non-native speakers.
|
208 |
</p>
|
209 |
<div class="row">
|
210 |
<div class="col-md-6">
|
211 |
<div class="accessibility-box">
|
212 |
+
⚠️ “Blood pressure critically low—initiate IV fluids immediately.”<br/>
|
213 |
+
[Dual-text & audio UI]
|
214 |
</div>
|
215 |
</div>
|
216 |
<div class="col-md-6">
|
217 |
+
<p><strong>Dataset & Metrics:</strong></p>
|
218 |
<ul>
|
219 |
+
<li>600 hrs clinical dialogues</li>
|
220 |
<li>ANOVA on MOS (p < 0.01)</li>
|
221 |
+
<li>Speaker similarity ≥ 92%; MOS intelligibility ≥ 4.5/5</li>
|
222 |
</ul>
|
223 |
</div>
|
224 |
</div>
|
|
|
229 |
<h2>6. BibTeX</h2>
|
230 |
<pre>@article{coleman2025xttvmed,
|
231 |
author = {Coleman, Chris and Becker, Anthony},
|
232 |
+
title = {XTTVS-MED: Real-Time Voice-to-Voice Semantic Cloning to Prevent Medical Miscommunication},
|
233 |
journal = {GhostAI Labs},
|
234 |
year = {2025}
|
235 |
}</pre>
|