Spaces:

ghostai1
/

GHOSTVOICECBR

Running

App Files Files Community

ghostai1 commited on May 24

Commit

0cab41e

verified ·

1 Parent(s): 376cb8f

Update index.html

Browse files

Files changed (1) hide show

index.html +41 -43

index.html CHANGED Viewed

@@ -75,7 +75,7 @@
     <!-- Header -->
     <header class="text-center mb-5">
       <h1>XTTVS-MED</h1>
-      <p class="lead">Real-time 4-Bit Semantic Voice Cloning for Emergency & Accessibility</p>
       <p><strong>Chris Coleman</strong> &mdash; GhostAI Labs<br>
          <strong>Dr. Anthony Becker, M.D.</strong> &mdash; Medical Advisor
       </p>
@@ -85,17 +85,19 @@
     <section id="overview" class="mb-5">
       <h2>1. Overview</h2>
       <p>
-        XTTVS-MED fuses aggressive 4-bit quantization, LoRA speaker/emotion adapters, and a float-aligned CBR-RTree scheduler to generate emotion-aware, multilingual speech with sub-second latency on devices with ≥6 GB VRAM.
       </p>
       <div class="diagram mermaid">
 flowchart LR
-  T["Input Text"] --> S["Split & Preprocess<br/>(Mel-Spectrogram)"]
   S --> L["LoRA Adapters<br/>(Speaker/Emotion/Urgency)"]
   L --> Q["FloatBin Quantization<br/>(FP32→FP16→INT4)"]
   Q --> C["CBR-RTree Scheduler<br/>(Urgency/Pitch/Emotion)"]
   C --> M["XTTSv2 Transformer"]
   M --> V["Vocoder<br/>(WaveRNN/HiFiGAN)"]
-  V --> A["Output Audio"]
       </div>
     </section>
@@ -107,36 +109,33 @@ sequenceDiagram
   participant U as User
   participant G as Gradio UI
   participant A as FastAPI
-  participant M as Model Server
   participant D as Disk(outputs/)
-  U->>G: Enter text & params
-  G->>A: POST /voice
-  A->>M: gen_voice(...)
-  M-->>A: audio + metrics
-  A->>G: return file + info
   A->>D: save MP3
       </div>
       <pre>
-// Pseudocode: CBR-RTree insertion & retrieval
-def insert(node, v4, t_fp):
-    if not node: return Node(v4, t_fp)
-    if t_fp < node.t_fp:
-        node.left  = insert(node.left, v4, t_fp)
-    else:
-        node.right = insert(node.right, v4, t_fp)
-    return node
-def retrieve(node, t_target):
-    if not node: return None
-    child = node.left if abs(node.left.t_fp - t_target) < abs(node.right.t_fp - t_target) else node.right
-    return retrieve(child, t_target) or child
       </pre>
     </section>
     <!-- Performance -->
     <section id="performance" class="mb-5">
       <h2>3. Hardware Scalability & Throughput</h2>
-      <p>Fully on-premise, HIPAA/GDPR compliant, running on:</p>
       <div class="diagram mermaid">
 flowchart TB
   HF200["HF200 Cluster<br/>0.15 s"] --> H100["DGX H100<br/>0.25 s"]
@@ -180,47 +179,46 @@ flowchart TB
     <!-- Translation + Quick LoRA -->
     <section id="translation" class="mb-5">
-      <h2>4. Translation & Quick LoRA Epoch Training</h2>
       <p>
-        XTTVS-MED auto-detects 50+ languages in ≤200 ms via an acoustic n-gram classifier.
-        For unsupported dialects, a <strong>quick LoRA epoch</strong>—using 1–2 hrs of local audio—adapts the base model in under 30 minutes.
       </p>
       <div class="diagram mermaid">
 flowchart LR
-  D["Dialect Audio (1–2 hrs)"]
   --> P["Preprocess & Align"]
-  --> T["Train LoRA Epoch<br/>(5–10 epochs)"]
   --> U["Updated Adapters"]
-  --> M["Inference Pipeline"]
       </div>
       <ul>
-        <li><strong>Step 1:</strong> Record ~1 hr of target dialect speech.</li>
-        <li><strong>Step 2:</strong> Extract Mel-spectrograms, align to transcripts.</li>
-        <li><strong>Step 3:</strong> Train LoRA adapters for speaker + dialect (5–10 epochs, 30 min).</li>
-        <li><strong>Step 4:</strong> Deploy updated adapters; new dialect instantaneously available.</li>
       </ul>
     </section>
     <!-- Clinical Impact -->
     <section id="impact" class="mb-5">
-      <h2>5. Clinical Impact & Data Science</h2>
       <p>
-        Each second saved in emergency care reduces mortality risk by ~7%. XTTVS-MED’s
-        200 ms detection +  <1 s synthesis can improve survival by 10–15% for non-native speakers.
       </p>
       <div class="row">
         <div class="col-md-6">
           <div class="accessibility-box">
-            ⚠️ “Blood pressure critically low—initiate IV fluids immediately.”
-            <br>[Dual-text & audio UI]
           </div>
         </div>
         <div class="col-md-6">
-          <p><strong>Dataset & Validation:</strong></p>
           <ul>
-            <li>600 hrs multilingual clinical dialogues</li>
             <li>ANOVA on MOS (p &lt; 0.01)</li>
-            <li>Speaker similarity ≥ 92%; intelligibility MOS ≥ 4.5/5</li>
           </ul>
         </div>
       </div>
@@ -231,7 +229,7 @@ flowchart LR
       <h2>6. BibTeX</h2>
       <pre>@article{coleman2025xttvmed,
   author    = {Coleman, Chris and Becker, Anthony},
-  title     = {XTTVS-MED: Real-Time Semantic 4-Bit Voice Cloning to Prevent Medical Miscommunication},
   journal   = {GhostAI Labs},
   year      = {2025}
 }</pre>

     <!-- Header -->
     <header class="text-center mb-5">
       <h1>XTTVS-MED</h1>
+      <p class="lead">Real-time 4-Bit Semantic Voice Cloning & Voice-to-Voice Translation</p>
       <p><strong>Chris Coleman</strong> &mdash; GhostAI Labs<br>
          <strong>Dr. Anthony Becker, M.D.</strong> &mdash; Medical Advisor
       </p>
     <section id="overview" class="mb-5">
       <h2>1. Overview</h2>
       <p>
+        XTTVS-MED fuses Whisper ASR, 4-bit quantization, LoRA adapters, and a float-aligned CBR-RTree scheduler
+        to deliver sub-second, emotion-aware, multilingual voice-to-voice translation on devices ≥6 GB VRAM.
       </p>
       <div class="diagram mermaid">
 flowchart LR
+  A["Input Audio"] --> W["Whisper ASR<br/>(Transcribe/Detect Lang)"]
+  W --> S["Normalize & Preprocess<br/>(Mel-Spectrogram)"]
   S --> L["LoRA Adapters<br/>(Speaker/Emotion/Urgency)"]
   L --> Q["FloatBin Quantization<br/>(FP32→FP16→INT4)"]
   Q --> C["CBR-RTree Scheduler<br/>(Urgency/Pitch/Emotion)"]
   C --> M["XTTSv2 Transformer"]
   M --> V["Vocoder<br/>(WaveRNN/HiFiGAN)"]
+  V --> B["Output Audio"]
       </div>
     </section>
   participant U as User
   participant G as Gradio UI
   participant A as FastAPI
+  participant W as Whisper
+  participant M as TTS Server
   participant D as Disk(outputs/)
+  U->>G: Record/Input Audio
+  G->>A: POST /voice2voice
+  A->>W: Whisper.transcribe(audio)
+  W-->>A: text + lang
+  A->>M: gen_voice(text, lang, settings)
+  M-->>A: synthesized audio + metrics
+  A->>G: return output audio & info
   A->>D: save MP3
       </div>
       <pre>
+// Pseudocode: Voice-to-Voice pipeline with CBR-RTree
+def voice2voice(audio):
+    text, lang = whisper.transcribe(audio)
+    v4, t_fp = preprocess(text)
+    node = insert(None, v4, t_fp)
+    best = retrieve(node, t_fp)
+    return tts.generate(text, adapter=best.adapter)
       </pre>
     </section>
     <!-- Performance -->
     <section id="performance" class="mb-5">
       <h2>3. Hardware Scalability & Throughput</h2>
+      <p>On-premise, HIPAA/GDPR compliant, supporting:</p>
       <div class="diagram mermaid">
 flowchart TB
   HF200["HF200 Cluster<br/>0.15 s"] --> H100["DGX H100<br/>0.25 s"]
     <!-- Translation + Quick LoRA -->
     <section id="translation" class="mb-5">
+      <h2>4. Quick LoRA Epoch Training</h2>
       <p>
+        For unsupported dialects: record 1–2 hrs of local speech, then train LoRA adapters—5–10 epochs in <strong>30 min</strong>—to extend coverage instantly.
       </p>
       <div class="diagram mermaid">
 flowchart LR
+  D["Dialect Samples (1–2 hrs)"]
   --> P["Preprocess & Align"]
+  --> T["Train LoRA Epochs<br/>(5–10)"]
   --> U["Updated Adapters"]
+  --> I["Immediate Inference"]
       </div>
       <ul>
+        <li>Step 1: Capture ~1 hr dialect audio.</li>
+        <li>Step 2: Generate aligned spectrograms.</li>
+        <li>Step 3: Fine-tune LoRA adapters (30 min).</li>
+        <li>Step 4: Deploy instantly for voice-to-voice.</li>
       </ul>
     </section>
     <!-- Clinical Impact -->
     <section id="impact" class="mb-5">
+      <h2>5. Clinical Impact & Validation</h2>
       <p>
+        Every second saved reduces mortality by ~7%.
+        Audio-to-audio translation in <1 s can improve survival by 10–15% for non-native speakers.
       </p>
       <div class="row">
         <div class="col-md-6">
           <div class="accessibility-box">
+            ⚠️ “Blood pressure critically low—initiate IV fluids immediately.”<br/>
+            [Dual-text & audio UI]
           </div>
         </div>
         <div class="col-md-6">
+          <p><strong>Dataset & Metrics:</strong></p>
           <ul>
+            <li>600 hrs clinical dialogues</li>
             <li>ANOVA on MOS (p &lt; 0.01)</li>
+            <li>Speaker similarity ≥ 92%; MOS intelligibility ≥ 4.5/5</li>
           </ul>
         </div>
       </div>
       <h2>6. BibTeX</h2>
       <pre>@article{coleman2025xttvmed,
   author    = {Coleman, Chris and Becker, Anthony},
+  title     = {XTTVS-MED: Real-Time Voice-to-Voice Semantic Cloning to Prevent Medical Miscommunication},
   journal   = {GhostAI Labs},
   year      = {2025}
 }</pre>