Spaces:

abocha
/

esl-dialogue-tts

Running

App Files Files

xet

Community

abocha commited on May 7

Commit

d48101f

1 Parent(s): d44dfc0

ver 2

Browse files

Files changed (6) hide show

.gitignore +1 -1
README.md +65 -6
app.py +141 -131
utils/merge_audio.py +112 -69
utils/openai_tts.py +139 -83
utils/script_parser.py +58 -26

.gitignore CHANGED Viewed

@@ -1,4 +1,4 @@
-pycache/
 *.pyc
 *.pyo
 *.pyd

+__pycache__/
 *.pyc
 *.pyo
 *.pyd

README.md CHANGED Viewed

@@ -1,12 +1,71 @@
 ---
-title: Esl Dialogue Tts
-emoji: 📈
-colorFrom: indigo
-colorTo: yellow
 sdk: gradio
-sdk_version: 5.29.0
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Dialogue TTS
+emoji: 🗣️🎙️
+colorFrom: blue
+colorTo: green
 sdk: gradio
 app_file: app.py
 pinned: false
 ---
+# Dialogue Script to Speech Synthesis
+This Hugging Face Space converts dialogue scripts into speech using OpenAI's TTS models (`tts-1`, `tts-1-hd`, `gpt-4o-mini-tts`).
+## Features
+*   **Input Script**: Provide a dialogue script with lines in the format `[Speaker] Utterance`.
+*   **TTS Models**: Choose from `tts-1`, `tts-1-hd`, or `gpt-4o-mini-tts`.
+*   **Voice Configuration**:
+    *   **Single Global Voice**: Use one voice for all speakers.
+    *   **Random per Speaker**: Assigns a unique random voice to each speaker consistently within a run.
+    *   **A/B Round Robin**: Cycles through available voices for each unique speaker.
+    *   **Detailed Per-Speaker UI**: Configure voice, speed (for `tts-1/hd`), and emotional vibe/custom instructions (for `gpt-4o-mini-tts`) for each speaker individually.
+*   **Output**:
+    *   A ZIP file containing individual MP3s for each line.
+    *   A single merged MP3 of the entire dialogue with custom pauses.
+*   **Cost Estimation**: Displays an estimated cost before generating audio.
+*   **NSFW Check**: Optional content safety check using an external API (if `NSFW_API_URL_TEMPLATE` is configured).
+## How to Use
+1.  **Enter your dialogue script** in the text area.
+    Example:
+    ```
+    [Alice] Hello Bob, how are you today?
+    [Bob] I'm doing great, Alice! Thanks for asking.
+    [Narrator] And so their conversation began.
+    ```
+2.  **Select the TTS Model**.
+3.  **Set the pause duration** (in milliseconds) between lines for the merged audio.
+4.  **Choose a Speaker Configuration Method**:
+    *   If "Single Voice (Global)", select the voice.
+    *   If "Detailed Configuration...", click "Load/Refresh Per-Speaker Settings UI" and adjust settings for each speaker.
+    *   Other methods will apply voices automatically.
+5.  (Optional) Adjust **Global Speed** or **Global Instructions** if applicable to your chosen model and configuration.
+6.  Click **"Calculate Cost"** to see an estimate.
+7.  Click **"Generate Audio"**.
+8.  Download the ZIP file or listen to/download the merged MP3.
+## Secrets
+This Space requires the following secrets to be set in the Hugging Face Space settings:
+*   `OPENAI_API_KEY`: Your OpenAI API key.
+*   `NSFW_API_URL_TEMPLATE` (Optional): URL template for NSFW checking, e.g., `https://api.example.com/check?text={text}`. The placeholder `{text}` will be URL-encoded.
+*   `MODEL_DEFAULT` (Optional): Default TTS model (e.g., `tts-1-hd`).
+## Smoke Test Script
+Use the following script to test basic functionality:
+[Gandalf] You shall not pass!
+[Frodo] I will take the Ring to Mordor.
+[Gandalf] So be it.
+Choose your desired model and settings (e.g., "Random per Speaker"), then generate.
+## Deployment
+This application is designed to be deployed as a Hugging Face Space.
+Ensure `ffmpeg` is available (handled by `container.yaml` for Classic Spaces).
+Set the necessary secrets in your Space settings on Hugging Face Hub.

app.py CHANGED Viewed

@@ -46,13 +46,13 @@ SPEAKER_CONFIG_METHODS = [
     "Single Voice (Global)",
     "Random per Speaker",
     "A/B Round Robin",
-    "Detailed Configuration (Per Speaker UI)" # New Method Name
 ]
 DEFAULT_SPEAKER_CONFIG_METHOD = "Random per Speaker"
-APP_AVAILABLE_VOICES = ALL_TTS_VOICES.copy()
 PREDEFINED_VIBES = {
-    "None": "", # No specific instruction
     "Calm": "Speak in a calm, composed, and relaxed manner.",
     "Excited": "Speak with an energetic, enthusiastic, and lively tone.",
     "Happy": "Speak with a cheerful, bright, and joyful voice.",
@@ -63,7 +63,7 @@ PREDEFINED_VIBES = {
     "Formal": "Speak in a clear, precise, and professional tone, suitable for a formal address.",
     "Authoritative": "Speak with a commanding, confident, and firm voice.",
     "Friendly": "Speak in a warm, approachable, and amiable manner.",
-    "Custom...": "CUSTOM" # Special value indicating custom text should be used
 }
 VIBE_CHOICES = list(PREDEFINED_VIBES.keys())
 DEFAULT_VIBE = "None"
@@ -72,45 +72,32 @@ def get_speakers_from_script(script_text):
     if not script_text.strip(): return []
     try:
         parsed_lines, _ = parse_dialogue_script(script_text)
-        return sorted(list(set(p["speaker"] for p in parsed_lines)))
     except ValueError: return []
 def handle_dynamic_input_change(new_value, current_configs_state_dict, speaker_name, config_key, tts_model):
-    """
-    Updates the gr.State dictionary when a dynamic UI element changes.
-    current_configs_state_dict is the raw dictionary from gr.State.
-    """
     if speaker_name not in current_configs_state_dict:
         current_configs_state_dict[speaker_name] = {}
     current_configs_state_dict[speaker_name][config_key] = new_value
-    # Special handling for Vibe -> Custom Instructions visibility (Simpler: custom textbox always visible)
-    # For this iteration, custom textbox is always visible. Backend decides to use it.
-    # Determine visibility/interactivity of speed slider for this specific speaker's UI (if we were to update it directly)
-    # This is complex to do from a generic handler. Better to set initial visibility in load_refresh_per_speaker_ui.
-    # Global tts_model_dropdown change will refresh the whole dynamic UI if needed for speed/instr applicability.
     return current_configs_state_dict
 def load_refresh_per_speaker_ui(script_text, current_configs_state_dict, tts_model):
-    """
-    Generates the dynamic UI components (accordions) for each speaker.
-    Returns a list of Gradio components and the updated state.
-    """
     unique_speakers = get_speakers_from_script(script_text)
     new_ui_components = []
-    # Ensure state dict is not None (Gradio might pass None initially for gr.State)
     if current_configs_state_dict is None:
         current_configs_state_dict = {}
-    # Update state for any new speakers or remove speakers no longer in script
-    # (Optional: more complex logic could be to remove speakers from state if not in script)
-    # For now, just add new ones with defaults if not present.
     for speaker_name in unique_speakers:
         if speaker_name not in current_configs_state_dict:
             current_configs_state_dict[speaker_name] = {
@@ -119,7 +106,6 @@ def load_refresh_per_speaker_ui(script_text, current_configs_state_dict, tts_mod
                 "vibe": DEFAULT_VIBE,
                 "custom_instructions": ""
             }
-        # Ensure all keys exist for existing speakers (e.g., if new fields added)
         current_configs_state_dict[speaker_name].setdefault("voice", APP_AVAILABLE_VOICES[0])
         current_configs_state_dict[speaker_name].setdefault("speed", 1.0)
         current_configs_state_dict[speaker_name].setdefault("vibe", DEFAULT_VIBE)
@@ -128,42 +114,37 @@ def load_refresh_per_speaker_ui(script_text, current_configs_state_dict, tts_mod
     if not unique_speakers:
         new_ui_components.append(gr.Markdown("No speakers detected in the script, or script is empty. Type a script and click 'Load/Refresh' again."))
-        # Return current (possibly empty) state and the markdown message
         return new_ui_components, current_configs_state_dict
     for speaker_name in unique_speakers:
-        speaker_cfg = current_configs_state_dict[speaker_name] # Should exist now
-        # Determine if speed/instructions are applicable for the current global TTS model
         speed_interactive = tts_model in ["tts-1", "tts-1-hd"]
-        instructions_relevant = tts_model == "gpt-4o-mini-tts" # Vibe/Custom is primarily for this
         with gr.Accordion(label=f"Settings for: {speaker_name}", open=False) as speaker_accordion:
-            # Voice Dropdown
             voice_dd = gr.Dropdown(
                 label="Voice", choices=APP_AVAILABLE_VOICES, value=speaker_cfg["voice"], interactive=True
             )
             voice_dd.change(
                 fn=partial(handle_dynamic_input_change, speaker_name=speaker_name, config_key="voice", tts_model=tts_model),
-                inputs=[voice_dd, speaker_configs_state], # Pass the component itself and the state
                 outputs=[speaker_configs_state]
             )
-            # Speed Slider
             speed_slider_label = "Speech Speed" + (" (Active for tts-1/hd)" if speed_interactive else " (N/A for this model)")
             speed_slider = gr.Slider(
                 label=speed_slider_label, minimum=0.25, maximum=4.0, value=speaker_cfg["speed"],
                 step=0.05, interactive=speed_interactive
             )
-            if speed_interactive: # Only attach listener if interactive
-                speed_slider.release( # Use release to avoid too many updates during drag
                     fn=partial(handle_dynamic_input_change, speaker_name=speaker_name, config_key="speed", tts_model=tts_model),
                     inputs=[speed_slider, speaker_configs_state],
                     outputs=[speaker_configs_state]
                 )
-            # Vibe Dropdown
             vibe_label = "Vibe/Emotion Preset" + (" (For gpt-4o-mini-tts)" if instructions_relevant else " (Less impact on other models)")
             vibe_dd = gr.Dropdown(
                 label=vibe_label, choices=VIBE_CHOICES, value=speaker_cfg["vibe"], interactive=True
@@ -174,16 +155,15 @@ def load_refresh_per_speaker_ui(script_text, current_configs_state_dict, tts_mod
                 outputs=[speaker_configs_state]
             )
-            # Custom Instructions Textbox
             custom_instr_label = "Custom Instructions"
             custom_instr_placeholder = "Only used if Vibe is 'Custom...'. Overrides Vibe."
             custom_instr_tb = gr.Textbox(
                 label=custom_instr_label,
                 value=speaker_cfg["custom_instructions"],
                 placeholder=custom_instr_placeholder,
-                lines=2, interactive=True # Always interactive, backend logic decides if used
             )
-            custom_instr_tb.input( # Use input for real-time typing updates
                 fn=partial(handle_dynamic_input_change, speaker_name=speaker_name, config_key="custom_instructions", tts_model=tts_model),
                 inputs=[custom_instr_tb, speaker_configs_state],
                 outputs=[speaker_configs_state]
@@ -196,7 +176,6 @@ def load_refresh_per_speaker_ui(script_text, current_configs_state_dict, tts_mod
 async def handle_script_processing(
     dialogue_script: str, tts_model: str, pause_ms: int,
     speaker_config_method: str, global_voice_selection: str,
-    # No more df_value, instead we use speaker_configs_state_dict from gr.State
     speaker_configs_state_dict: dict,
     global_speed: float,
     global_instructions: str, progress=gr.Progress(track_tqdm=True)):
@@ -204,65 +183,65 @@ async def handle_script_processing(
     if not OPENAI_API_KEY or not async_openai_client: return None, None, "Error: OPENAI_API_KEY missing."
     if not dialogue_script.strip(): return None, None, "Error: Script empty."
-    job_audio_path_prefix = os.path.join(tempfile.gettempdir(), "current_job_audio")
     if os.path.exists(job_audio_path_prefix): shutil.rmtree(job_audio_path_prefix)
     os.makedirs(job_audio_path_prefix, exist_ok=True)
     try:
         parsed_lines, _ = parse_dialogue_script(dialogue_script)
-        if not parsed_lines: return None, None, "Error: No valid lines."
-    except ValueError as e: return None, None, f"Script error: {str(e)}"
-    # Ensure state dict is usable
     if speaker_configs_state_dict is None: speaker_configs_state_dict = {}
     tasks, line_audio_files = [], [None] * len(parsed_lines)
     for i, line_data in enumerate(parsed_lines):
         speaker_name = line_data["speaker"]
-        # Determine voice, speed, instructions for this line
-        line_voice = global_voice_selection
         line_speed = global_speed
         line_instructions = global_instructions if global_instructions and global_instructions.strip() else None
         if speaker_config_method == "Detailed Configuration (Per Speaker UI)":
             spk_cfg = speaker_configs_state_dict.get(speaker_name, {})
-            line_voice = spk_cfg.get("voice", global_voice_selection) # Fallback to global if needed
-            # Speed: per-speaker if tts-1/hd and set, else global if tts-1/hd, else API default
             if tts_model in ["tts-1", "tts-1-hd"]:
                 line_speed = spk_cfg.get("speed", global_speed)
-            # Instructions: primarily for gpt-4o-mini-tts
             if tts_model == "gpt-4o-mini-tts":
                 vibe = spk_cfg.get("vibe", DEFAULT_VIBE)
                 custom_instr = spk_cfg.get("custom_instructions", "").strip()
-                if vibe == "Custom..." and custom_instr:
-                    line_instructions = custom_instr
-                elif vibe != "None" and vibe != "Custom...":
-                    line_instructions = PREDEFINED_VIBES.get(vibe, "")
-                # If vibe is None or Custom with no text, line_instructions might remain global or become ""
-                if not line_instructions and global_instructions and global_instructions.strip(): # Fallback to global if specific instructions are empty
-                    line_instructions = global_instructions
-                elif not line_instructions : # Ensure it's None if truly no instruction
-                     line_instructions = None
-        elif speaker_config_method == "Random per Speaker":
-            # Simplified: assign random now, could be cached as before for consistency within run
-            line_voice = random.choice(APP_AVAILABLE_VOICES)
-        elif speaker_config_method == "A/B Round Robin":
-            # Simplified: assign A/B now
-            unique_script_speakers = get_speakers_from_script(dialogue_script) # Re-get for this logic
-            speaker_idx = unique_script_speakers.index(speaker_name) if speaker_name in unique_script_speakers else 0
-            line_voice = APP_AVAILABLE_VOICES[speaker_idx % len(APP_AVAILABLE_VOICES)]
-        # Fallback for speed if not tts-1/hd (API won't use it anyway)
-        if tts_model not in ["tts-1", "tts-1-hd"]:
-            line_speed = 1.0 # API default, won't be sent
         out_fn = os.path.join(job_audio_path_prefix, f"line_{line_data['id']}.mp3")
-        progress(i / len(parsed_lines), desc=f"Line {i+1}/{len(parsed_lines)} ({speaker_name})")
         tasks.append(synthesize_speech_line(
             client=async_openai_client, text=line_data["text"], voice=line_voice,
             output_path=out_fn, model=tts_model, speed=line_speed,
@@ -271,102 +250,125 @@ async def handle_script_processing(
     results = await asyncio.gather(*tasks, return_exceptions=True)
     for idx, res in enumerate(results):
-        if isinstance(res, Exception): print(f"Error line {parsed_lines[idx]['id']}: {res}")
-        elif res is None: print(f"Skipped/failed line {parsed_lines[idx]['id']}")
-        else: line_audio_files[idx] = res
-    valid_files = [f for f in line_audio_files if f and os.path.exists(f) and os.path.getsize(f) > 0]
-    if not valid_files:
-        shutil.rmtree(job_audio_path_prefix); return None, None, "Error: No audio synthesized."
     zip_fn = os.path.join(job_audio_path_prefix, "dialogue_lines.zip")
-    with zipfile.ZipFile(zip_fn, 'w') as zf: [zf.write(p, os.path.basename(p)) for p in valid_files]
     merged_fn = os.path.join(job_audio_path_prefix, "merged_dialogue.mp3")
-    merged_path = merge_mp3_files([f for f in line_audio_files if f], merged_fn, pause_ms)
-    status = f"{len(valid_files)}/{len(parsed_lines)} lines. "
-    if len(valid_files) < len(parsed_lines): status += "Some failed. "
-    if not merged_path and len(valid_files) > 0: status += "Merge failed. "
-    elif not merged_path: status += "No audio."
-    else: status += "Generated."
     return (zip_fn if os.path.exists(zip_fn) else None,
            merged_path if merged_path and os.path.exists(merged_path) else None,
            status)
 def handle_calculate_cost(dialogue_script: str, tts_model: str):
-    # ... (same as before) ...
-    if not dialogue_script.strip(): return "Cost: $0.00 (Empty)"
     try:
         parsed, chars = parse_dialogue_script(dialogue_script)
-        if not parsed: return "Cost: $0.00 (No lines)"
         cost = calculate_cost(chars, len(parsed), tts_model)
-        return f"Est. Cost: ${cost:.6f}"
-    except Exception as e: return f"Cost calc error: {str(e)}"
 with gr.Blocks(theme=gr.themes.Soft()) as demo:
-    gr.Markdown("# Dialogue Script to Speech (Dynamic Per-Speaker UI)")
     if not OPENAI_API_KEY or not async_openai_client:
-        gr.Markdown("<h3 style='color:red;'>Warning: OPENAI_API_KEY not set.</h3>")
-    # State to hold detailed speaker configurations
     speaker_configs_state = gr.State({})
     with gr.Row():
         with gr.Column(scale=2):
-            script_input = gr.TextArea(label="Dialogue Script", placeholder="[S1] Hi!\n[S2] Hello!", lines=10)
         with gr.Column(scale=1):
             tts_model_dropdown = gr.Dropdown(TTS_MODELS_AVAILABLE, label="TTS Model", value=MODEL_DEFAULT)
-            pause_input = gr.Number(label="Pause (ms)", value=500, minimum=0, maximum=5000, step=50)
-            global_speed_input = gr.Slider(minimum=0.25, maximum=4.0, value=1.0, step=0.05, label="Global Speed", visible=(MODEL_DEFAULT in ["tts-1", "tts-1-hd"]), interactive=True)
-            global_instructions_input = gr.Textbox(label="Global Instructions", placeholder="e.g., Speak calmly.", visible=(MODEL_DEFAULT == "gpt-4o-mini-tts"), interactive=True, lines=2)
-    gr.Markdown("### Speaker Configuration Method")
     speaker_config_method_dropdown = gr.Dropdown(
-        SPEAKER_CONFIG_METHODS, label="Method", value=DEFAULT_SPEAKER_CONFIG_METHOD
     )
-    # UI for "Single Voice (Global)"
     with gr.Group(visible=(DEFAULT_SPEAKER_CONFIG_METHOD == "Single Voice (Global)")) as single_voice_group:
         global_voice_dropdown = gr.Dropdown(
-            APP_AVAILABLE_VOICES, label="Global Voice", value=APP_AVAILABLE_VOICES[0], interactive=True
         )
-    # UI for "Detailed Configuration (Per Speaker UI)"
     with gr.Column(visible=(DEFAULT_SPEAKER_CONFIG_METHOD == "Detailed Configuration (Per Speaker UI)")) as detailed_per_speaker_ui_group:
         load_per_speaker_ui_button = gr.Button("Load/Refresh Per-Speaker Settings UI (from Script Above)")
-        gr.Markdown("<small>Click button above to populate settings for each speaker found in the script. Settings are per-speaker.</small>")
-        # This column will be populated by the output of load_per_speaker_ui_button
         dynamic_speaker_ui_area = gr.Column(elem_id="dynamic_ui_area_for_speakers")
     with gr.Row():
-        calculate_cost_button = gr.Button("Calculate Cost")
         generate_button = gr.Button("Generate Audio", variant="primary")
     cost_output = gr.Textbox(label="Estimated Cost", interactive=False)
     with gr.Row():
-        individual_lines_zip_output = gr.File(label="Download ZIP")
-        merged_dialogue_mp3_output = gr.Audio(label="Merged MP3", type="filepath")
-    status_output = gr.Textbox(label="Status", interactive=False, lines=1)
-    # --- Event Handlers ---
     def update_model_controls_visibility(selected_model, script_text_for_refresh, current_speaker_configs_for_refresh):
-        # When model changes, also refresh the dynamic UI because speed/instr applicability changes
-        # This means load_refresh_per_speaker_ui will be called.
-        new_dynamic_ui, updated_state = load_refresh_per_speaker_ui(script_text_for_refresh, current_speaker_configs_for_refresh, selected_model)
-        is_tts1 = selected_model in ["tts-1", "tts-1-hd"]
-        is_gpt_mini = selected_model == "gpt-4o-mini-tts"
         return {
-            global_speed_input: gr.update(visible=is_tts1, interactive=is_tts1),
-            global_instructions_input: gr.update(visible=is_gpt_mini, interactive=is_gpt_mini),
-            dynamic_speaker_ui_area: new_dynamic_ui, # Return the actual list of components
             speaker_configs_state: updated_state
         }
     tts_model_dropdown.change(
         fn=update_model_controls_visibility,
         inputs=[tts_model_dropdown, script_input, speaker_configs_state],
@@ -376,7 +378,6 @@ with gr.Blocks(theme=gr.themes.Soft()) as demo:
     def update_speaker_config_method_visibility(method):
         is_single = (method == "Single Voice (Global)")
         is_detailed_per_speaker = (method == "Detailed Configuration (Per Speaker UI)")
-        # Add more if other methods exist...
         return {
             single_voice_group: gr.update(visible=is_single),
             detailed_per_speaker_ui_group: gr.update(visible=is_detailed_per_speaker),
@@ -390,40 +391,49 @@ with gr.Blocks(theme=gr.themes.Soft()) as demo:
     load_per_speaker_ui_button.click(
         fn=load_refresh_per_speaker_ui,
         inputs=[script_input, speaker_configs_state, tts_model_dropdown],
-        # Output the list of components to the column, and the updated state to the state component
         outputs=[dynamic_speaker_ui_area, speaker_configs_state]
     )
     calculate_cost_button.click(fn=handle_calculate_cost, inputs=[script_input, tts_model_dropdown], outputs=[cost_output])
-    # Generate button now takes speaker_configs_state as input
     generate_button.click(
         fn=handle_script_processing,
         inputs=[
             script_input, tts_model_dropdown, pause_input,
             speaker_config_method_dropdown, global_voice_dropdown,
-            speaker_configs_state, # Pass the state object
             global_speed_input, global_instructions_input
         ],
         outputs=[individual_lines_zip_output, merged_dialogue_mp3_output, status_output])
-    gr.Markdown("## Examples")
     gr.Examples(
         examples=[
-            ["[Alice] Hello from Alice!\n[Bob] Bob here, testing the dynamic UI.", "tts-1-hd", 300, "Detailed Configuration (Per Speaker UI)", APP_AVAILABLE_VOICES[0], {}, 1.0, ""],
-            ["[Narrator] Just one line, using global.", "tts-1", 0, "Single Voice (Global)", "fable", {}, 1.2, ""],
         ],
-        # Note: speaker_configs_state is passed as an empty dict {} for examples.
-        # The user would click "Load/Refresh Per-Speaker UI" after an example loads.
         inputs=[
             script_input, tts_model_dropdown, pause_input,
             speaker_config_method_dropdown, global_voice_dropdown,
             speaker_configs_state,
             global_speed_input, global_instructions_input
         ],
         outputs=[individual_lines_zip_output, merged_dialogue_mp3_output, status_output],
-        fn=handle_script_processing, cache_examples=False)
 if __name__ == "__main__":
-    if os.name == 'nt': asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
-    demo.launch(debug=True)

     "Single Voice (Global)",
     "Random per Speaker",
     "A/B Round Robin",
+    "Detailed Configuration (Per Speaker UI)"
 ]
 DEFAULT_SPEAKER_CONFIG_METHOD = "Random per Speaker"
+APP_AVAILABLE_VOICES = ALL_TTS_VOICES.copy() # Uses the extended list from openai_tts.py
 PREDEFINED_VIBES = {
+    "None": "",
     "Calm": "Speak in a calm, composed, and relaxed manner.",
     "Excited": "Speak with an energetic, enthusiastic, and lively tone.",
     "Happy": "Speak with a cheerful, bright, and joyful voice.",
     "Formal": "Speak in a clear, precise, and professional tone, suitable for a formal address.",
     "Authoritative": "Speak with a commanding, confident, and firm voice.",
     "Friendly": "Speak in a warm, approachable, and amiable manner.",
+    "Custom...": "CUSTOM"
 }
 VIBE_CHOICES = list(PREDEFINED_VIBES.keys())
 DEFAULT_VIBE = "None"
     if not script_text.strip(): return []
     try:
         parsed_lines, _ = parse_dialogue_script(script_text)
+        # Return unique speakers in order of appearance (though order doesn't strictly matter for this use)
+        seen_speakers = set()
+        ordered_unique_speakers = []
+        for p in parsed_lines:
+            if p["speaker"] not in seen_speakers:
+                ordered_unique_speakers.append(p["speaker"])
+                seen_speakers.add(p["speaker"])
+        return ordered_unique_speakers
     except ValueError: return []
 def handle_dynamic_input_change(new_value, current_configs_state_dict, speaker_name, config_key, tts_model):
     if speaker_name not in current_configs_state_dict:
         current_configs_state_dict[speaker_name] = {}
     current_configs_state_dict[speaker_name][config_key] = new_value
     return current_configs_state_dict
 def load_refresh_per_speaker_ui(script_text, current_configs_state_dict, tts_model):
     unique_speakers = get_speakers_from_script(script_text)
     new_ui_components = []
     if current_configs_state_dict is None:
         current_configs_state_dict = {}
     for speaker_name in unique_speakers:
         if speaker_name not in current_configs_state_dict:
             current_configs_state_dict[speaker_name] = {
                 "vibe": DEFAULT_VIBE,
                 "custom_instructions": ""
             }
         current_configs_state_dict[speaker_name].setdefault("voice", APP_AVAILABLE_VOICES[0])
         current_configs_state_dict[speaker_name].setdefault("speed", 1.0)
         current_configs_state_dict[speaker_name].setdefault("vibe", DEFAULT_VIBE)
     if not unique_speakers:
         new_ui_components.append(gr.Markdown("No speakers detected in the script, or script is empty. Type a script and click 'Load/Refresh' again."))
         return new_ui_components, current_configs_state_dict
     for speaker_name in unique_speakers:
+        speaker_cfg = current_configs_state_dict[speaker_name]
         speed_interactive = tts_model in ["tts-1", "tts-1-hd"]
+        instructions_relevant = tts_model == "gpt-4o-mini-tts"
         with gr.Accordion(label=f"Settings for: {speaker_name}", open=False) as speaker_accordion:
             voice_dd = gr.Dropdown(
                 label="Voice", choices=APP_AVAILABLE_VOICES, value=speaker_cfg["voice"], interactive=True
             )
             voice_dd.change(
                 fn=partial(handle_dynamic_input_change, speaker_name=speaker_name, config_key="voice", tts_model=tts_model),
+                inputs=[voice_dd, speaker_configs_state],
                 outputs=[speaker_configs_state]
             )
             speed_slider_label = "Speech Speed" + (" (Active for tts-1/hd)" if speed_interactive else " (N/A for this model)")
             speed_slider = gr.Slider(
                 label=speed_slider_label, minimum=0.25, maximum=4.0, value=speaker_cfg["speed"],
                 step=0.05, interactive=speed_interactive
             )
+            if speed_interactive:
+                speed_slider.release(
                     fn=partial(handle_dynamic_input_change, speaker_name=speaker_name, config_key="speed", tts_model=tts_model),
                     inputs=[speed_slider, speaker_configs_state],
                     outputs=[speaker_configs_state]
                 )
             vibe_label = "Vibe/Emotion Preset" + (" (For gpt-4o-mini-tts)" if instructions_relevant else " (Less impact on other models)")
             vibe_dd = gr.Dropdown(
                 label=vibe_label, choices=VIBE_CHOICES, value=speaker_cfg["vibe"], interactive=True
                 outputs=[speaker_configs_state]
             )
             custom_instr_label = "Custom Instructions"
             custom_instr_placeholder = "Only used if Vibe is 'Custom...'. Overrides Vibe."
             custom_instr_tb = gr.Textbox(
                 label=custom_instr_label,
                 value=speaker_cfg["custom_instructions"],
                 placeholder=custom_instr_placeholder,
+                lines=2, interactive=True
             )
+            custom_instr_tb.input(
                 fn=partial(handle_dynamic_input_change, speaker_name=speaker_name, config_key="custom_instructions", tts_model=tts_model),
                 inputs=[custom_instr_tb, speaker_configs_state],
                 outputs=[speaker_configs_state]
 async def handle_script_processing(
     dialogue_script: str, tts_model: str, pause_ms: int,
     speaker_config_method: str, global_voice_selection: str,
     speaker_configs_state_dict: dict,
     global_speed: float,
     global_instructions: str, progress=gr.Progress(track_tqdm=True)):
     if not OPENAI_API_KEY or not async_openai_client: return None, None, "Error: OPENAI_API_KEY missing."
     if not dialogue_script.strip(): return None, None, "Error: Script empty."
+    # Create a job-specific temporary directory and ensure it's clean
+    job_audio_path_prefix = os.path.join(tempfile.gettempdir(), f"dialogue_tts_job_{random.randint(10000, 99999)}")
     if os.path.exists(job_audio_path_prefix): shutil.rmtree(job_audio_path_prefix)
     os.makedirs(job_audio_path_prefix, exist_ok=True)
     try:
         parsed_lines, _ = parse_dialogue_script(dialogue_script)
+        if not parsed_lines:
+            shutil.rmtree(job_audio_path_prefix)
+            return None, None, "Error: No valid lines found in script."
+    except ValueError as e:
+        shutil.rmtree(job_audio_path_prefix)
+        return None, None, f"Script parsing error: {str(e)}"
     if speaker_configs_state_dict is None: speaker_configs_state_dict = {}
+    # --- Voice assignment map for Random and A/B per Speaker ---
+    speaker_voice_map = {}
+    if speaker_config_method in ["Random per Speaker", "A/B Round Robin"]:
+        unique_script_speakers_for_map = get_speakers_from_script(dialogue_script)
+        if speaker_config_method == "Random per Speaker":
+            for spk_name in unique_script_speakers_for_map:
+                speaker_voice_map[spk_name] = random.choice(APP_AVAILABLE_VOICES)
+        elif speaker_config_method == "A/B Round Robin":
+            for i, spk_name in enumerate(unique_script_speakers_for_map):
+                # Ensure APP_AVAILABLE_VOICES is not empty to prevent modulo by zero
+                if APP_AVAILABLE_VOICES:
+                     speaker_voice_map[spk_name] = APP_AVAILABLE_VOICES[i % len(APP_AVAILABLE_VOICES)]
+                else: # Fallback if voice list is somehow empty
+                     speaker_voice_map[spk_name] = "alloy" # Default OpenAI voice
+    # --- End voice assignment map ---
     tasks, line_audio_files = [], [None] * len(parsed_lines)
     for i, line_data in enumerate(parsed_lines):
         speaker_name = line_data["speaker"]
+        line_voice = global_voice_selection # Default for "Single Voice (Global)" or fallback
         line_speed = global_speed
         line_instructions = global_instructions if global_instructions and global_instructions.strip() else None
         if speaker_config_method == "Detailed Configuration (Per Speaker UI)":
             spk_cfg = speaker_configs_state_dict.get(speaker_name, {})
+            line_voice = spk_cfg.get("voice", global_voice_selection)
             if tts_model in ["tts-1", "tts-1-hd"]:
                 line_speed = spk_cfg.get("speed", global_speed)
             if tts_model == "gpt-4o-mini-tts":
                 vibe = spk_cfg.get("vibe", DEFAULT_VIBE)
                 custom_instr = spk_cfg.get("custom_instructions", "").strip()
+                if vibe == "Custom..." and custom_instr: line_instructions = custom_instr
+                elif vibe != "None" and vibe != "Custom...": line_instructions = PREDEFINED_VIBES.get(vibe, "")
+                if not line_instructions and global_instructions and global_instructions.strip(): line_instructions = global_instructions
+                elif not line_instructions : line_instructions = None
+        elif speaker_config_method == "Random per Speaker" or speaker_config_method == "A/B Round Robin":
+            line_voice = speaker_voice_map.get(speaker_name, global_voice_selection) # Use mapped voice
+        if tts_model not in ["tts-1", "tts-1-hd"]: line_speed = 1.0
         out_fn = os.path.join(job_audio_path_prefix, f"line_{line_data['id']}.mp3")
+        progress(i / len(parsed_lines), desc=f"Synthesizing: Line {i+1}/{len(parsed_lines)} ({speaker_name})")
         tasks.append(synthesize_speech_line(
             client=async_openai_client, text=line_data["text"], voice=line_voice,
             output_path=out_fn, model=tts_model, speed=line_speed,
     results = await asyncio.gather(*tasks, return_exceptions=True)
     for idx, res in enumerate(results):
+        if isinstance(res, Exception): print(f"Error synthesizing line {parsed_lines[idx]['id']}: {res}")
+        elif res is None: print(f"Skipped or failed synthesizing line {parsed_lines[idx]['id']}")
+        else: line_audio_files[parsed_lines[idx]['id']] = res # Store by original line ID if non-sequential
+    # Filter for valid, existing files, using the original parsed_lines order for merge
+    files_for_merge = []
+    for p_line in parsed_lines:
+        file_path = line_audio_files[p_line['id']]
+        if file_path and os.path.exists(file_path) and os.path.getsize(file_path) > 0:
+            files_for_merge.append(file_path)
+        else:
+            files_for_merge.append(None) # Keep placeholder for correct ordering if a line failed
+    valid_files_for_zip = [f for f in files_for_merge if f]
+    if not valid_files_for_zip:
+        shutil.rmtree(job_audio_path_prefix); return None, None, "Error: No audio was successfully synthesized."
     zip_fn = os.path.join(job_audio_path_prefix, "dialogue_lines.zip")
+    with zipfile.ZipFile(zip_fn, 'w') as zf:
+        for f_path in valid_files_for_zip:
+            zf.write(f_path, os.path.basename(f_path))
     merged_fn = os.path.join(job_audio_path_prefix, "merged_dialogue.mp3")
+    # Pass only existing files to merge_mp3_files, maintaining order
+    ordered_files_to_merge = [f for f in files_for_merge if f]
+    merged_path = merge_mp3_files(ordered_files_to_merge, merged_fn, pause_ms)
+    status = f"Successfully processed {len(valid_files_for_zip)} out of {len(parsed_lines)} lines. "
+    if len(valid_files_for_zip) < len(parsed_lines): status += "Some lines may have failed. "
+    if not merged_path and len(valid_files_for_zip) > 0: status += "Merging audio failed. "
+    elif not merged_path: status = "No audio to merge." # Overrides previous status if all failed before merge
+    else: status += "Merged audio generated."
+    # Note: job_audio_path_prefix (temp dir) is not explicitly deleted here.
+    # Gradio File/Audio components copy the file, so the temp dir can be cleaned
+    # by the OS or a cleanup routine if this Space were long-running.
+    # For HF Spaces, /tmp is ephemeral anyway. For robustness, could add shutil.rmtree(job_audio_path_prefix)
+    # after files are served, but need to ensure Gradio has finished with them.
+    # For now, rely on new unique dir per run and ephemeral /tmp.
     return (zip_fn if os.path.exists(zip_fn) else None,
            merged_path if merged_path and os.path.exists(merged_path) else None,
            status)
 def handle_calculate_cost(dialogue_script: str, tts_model: str):
+    if not dialogue_script.strip(): return "Cost: $0.00 (Script is empty)"
     try:
         parsed, chars = parse_dialogue_script(dialogue_script)
+        if not parsed: return "Cost: $0.00 (No valid lines in script)"
         cost = calculate_cost(chars, len(parsed), tts_model)
+        # Using .6f for precision, especially for char-based cost
+        return f"Estimated Cost for {len(parsed)} lines ({chars} chars): ${cost:.6f}"
+    except ValueError as e: # Catch script length error from parser
+        return f"Cost calculation error: {str(e)}"
+    except Exception as e:
+        return f"An unexpected error occurred during cost calculation: {str(e)}"
 with gr.Blocks(theme=gr.themes.Soft()) as demo:
+    gr.Markdown("# Dialogue Script to Speech (OpenAI TTS)")
     if not OPENAI_API_KEY or not async_openai_client:
+        gr.Markdown("<h3 style='color:red;'>⚠️ Warning: OPENAI_API_KEY secret is not set or invalid. Audio generation will fail. Please configure it in your Space settings.</h3>")
     speaker_configs_state = gr.State({})
     with gr.Row():
         with gr.Column(scale=2):
+            script_input = gr.TextArea(label="Dialogue Script", placeholder="[Speaker1] Hello world!\n[Speaker2] How are you today?", lines=10)
         with gr.Column(scale=1):
             tts_model_dropdown = gr.Dropdown(TTS_MODELS_AVAILABLE, label="TTS Model", value=MODEL_DEFAULT)
+            pause_input = gr.Number(label="Pause Between Lines (ms)", value=500, minimum=0, maximum=5000, step=50)
+            global_speed_input = gr.Slider(minimum=0.25, maximum=4.0, value=1.0, step=0.05, label="Global Speed (for tts-1/hd)", visible=(MODEL_DEFAULT in ["tts-1", "tts-1-hd"]), interactive=True)
+            global_instructions_input = gr.Textbox(label="Global Instructions (for gpt-4o-mini-tts)", placeholder="e.g., Speak with a calm tone.", visible=(MODEL_DEFAULT == "gpt-4o-mini-tts"), interactive=True, lines=2)
+    gr.Markdown("### Speaker Voice & Style Configuration")
     speaker_config_method_dropdown = gr.Dropdown(
+        SPEAKER_CONFIG_METHODS, label="Configuration Method", value=DEFAULT_SPEAKER_CONFIG_METHOD
     )
     with gr.Group(visible=(DEFAULT_SPEAKER_CONFIG_METHOD == "Single Voice (Global)")) as single_voice_group:
         global_voice_dropdown = gr.Dropdown(
+            APP_AVAILABLE_VOICES, label="Global Voice", value=APP_AVAILABLE_VOICES[0] if APP_AVAILABLE_VOICES else "alloy", interactive=True
         )
     with gr.Column(visible=(DEFAULT_SPEAKER_CONFIG_METHOD == "Detailed Configuration (Per Speaker UI)")) as detailed_per_speaker_ui_group:
         load_per_speaker_ui_button = gr.Button("Load/Refresh Per-Speaker Settings UI (from Script Above)")
+        gr.Markdown("<small>Click button above to populate settings for each speaker found in the script. Settings are applied per-speaker. If script changes, click again to refresh.</small>")
         dynamic_speaker_ui_area = gr.Column(elem_id="dynamic_ui_area_for_speakers")
     with gr.Row():
+        calculate_cost_button = gr.Button("Calculate Estimated Cost")
         generate_button = gr.Button("Generate Audio", variant="primary")
     cost_output = gr.Textbox(label="Estimated Cost", interactive=False)
     with gr.Row():
+        individual_lines_zip_output = gr.File(label="Download Individual Lines (ZIP)")
+        merged_dialogue_mp3_output = gr.Audio(label="Play/Download Merged Dialogue (MP3)", type="filepath")
+    status_output = gr.Textbox(label="Status", interactive=False, lines=2, max_lines=5)
     def update_model_controls_visibility(selected_model, script_text_for_refresh, current_speaker_configs_for_refresh):
+        new_dynamic_ui_components, updated_state = load_refresh_per_speaker_ui(script_text_for_refresh, current_speaker_configs_for_refresh, selected_model)
+        is_tts1_family = selected_model in ["tts-1", "tts-1-hd"]
+        is_gpt_mini_tts = selected_model == "gpt-4o-mini-tts"
+        # It's crucial that dynamic_speaker_ui_area receives the *list* of components.
+        # If it's wrapped in a gr.update, it might not render correctly unless gr.update(children=...)
+        # Direct assignment seems to be what Gradio expects when outputting to a Column/Row that acts as a container.
         return {
+            global_speed_input: gr.update(visible=is_tts1_family, interactive=is_tts1_family),
+            global_instructions_input: gr.update(visible=is_gpt_mini_tts, interactive=is_gpt_mini_tts),
+            dynamic_speaker_ui_area: new_dynamic_ui_components,
             speaker_configs_state: updated_state
         }
     tts_model_dropdown.change(
         fn=update_model_controls_visibility,
         inputs=[tts_model_dropdown, script_input, speaker_configs_state],
     def update_speaker_config_method_visibility(method):
         is_single = (method == "Single Voice (Global)")
         is_detailed_per_speaker = (method == "Detailed Configuration (Per Speaker UI)")
         return {
             single_voice_group: gr.update(visible=is_single),
             detailed_per_speaker_ui_group: gr.update(visible=is_detailed_per_speaker),
     load_per_speaker_ui_button.click(
         fn=load_refresh_per_speaker_ui,
         inputs=[script_input, speaker_configs_state, tts_model_dropdown],
         outputs=[dynamic_speaker_ui_area, speaker_configs_state]
     )
     calculate_cost_button.click(fn=handle_calculate_cost, inputs=[script_input, tts_model_dropdown], outputs=[cost_output])
     generate_button.click(
         fn=handle_script_processing,
         inputs=[
             script_input, tts_model_dropdown, pause_input,
             speaker_config_method_dropdown, global_voice_dropdown,
+            speaker_configs_state,
             global_speed_input, global_instructions_input
         ],
         outputs=[individual_lines_zip_output, merged_dialogue_mp3_output, status_output])
+    gr.Markdown("## Example Scripts")
+    example_script_1 = "[Alice] Hello Bob, this is a test using the detailed configuration method.\n[Bob] Hi Alice! I'm Bob, and I'll have my own voice settings.\n[Alice] Let's see how this sounds."
+    example_script_2 = "[Narrator] This is a short story.\n[CharacterA] Once upon a time...\n[Narrator] ...there was a Gradio app.\n[CharacterB] And it could talk!"
     gr.Examples(
         examples=[
+            [example_script_1, "tts-1-hd", 300, "Detailed Configuration (Per Speaker UI)", APP_AVAILABLE_VOICES[0] if APP_AVAILABLE_VOICES else "alloy", {}, 1.0, ""],
+            [example_script_2, "gpt-4o-mini-tts", 200, "Random per Speaker", APP_AVAILABLE_VOICES[0] if APP_AVAILABLE_VOICES else "alloy", {}, 1.0, "Speak with a gentle, storytelling voice for the narrator."],
+            ["[Solo] Just one line, using global voice and speed.", "tts-1", 0, "Single Voice (Global)", "fable", {}, 1.2, ""],
         ],
+        # speaker_configs_state is passed as an empty dict {} for examples.
+        # For "Detailed Configuration", the user should click "Load/Refresh Per-Speaker UI" after an example loads to populate the UI.
         inputs=[
             script_input, tts_model_dropdown, pause_input,
             speaker_config_method_dropdown, global_voice_dropdown,
             speaker_configs_state,
             global_speed_input, global_instructions_input
         ],
+        # Outputs for examples are not strictly necessary to pre-compute if cache_examples=False
+        # but defining them can help Gradio understand the flow.
+        # We can make the example click run the full processing.
         outputs=[individual_lines_zip_output, merged_dialogue_mp3_output, status_output],
+        fn=handle_script_processing,
+        cache_examples=False # Set to True if pre-computation is desired and feasible
+    )
 if __name__ == "__main__":
+    # Required for Windows if using asyncio with ProactorEventLoop which can be default
+    if os.name == 'nt':
+        asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
+    demo.launch(debug=True) # Debug=True for development, remove for production/HF Space

utils/merge_audio.py CHANGED Viewed

@@ -4,92 +4,135 @@ import os
 def merge_mp3_files(file_paths, output_filename, pause_ms=500):
     """
     Merges multiple MP3 files into a single MP3 file with a specified pause
-    between each segment.
     """
     if not file_paths:
         return None
-    combined = AudioSegment.empty()
-    pause_segment = AudioSegment.silent(duration=pause_ms) if pause_ms > 0 else AudioSegment.empty()
-    for i, file_path in enumerate(file_paths):
-        if not os.path.exists(file_path) or os.path.getsize(file_path) == 0:
-            print(f"Warning: File {file_path} is missing or empty. Skipping.")
-            continue
-        try:
-            segment = AudioSegment.from_mp3(file_path)
-            combined += segment
-            if i < len(file_paths) - 1: # Don't add pause after the last segment
-                combined += pause_segment
-        except Exception as e:
-            print(f"Error processing file {file_path}: {e}. Skipping.")
-            continue
-    if len(combined) == 0:
         print("No valid audio segments found to merge.")
         return None
     try:
-        combined.export(output_filename, format="mp3")
-        return output_filename
     except Exception as e:
         print(f"Error exporting merged MP3 to {output_filename}: {e}")
         return None
 if __name__ == '__main__':
-    # Create dummy mp3 files for testing (requires ffmpeg to be installed and pydub)
-    # This test assumes you have some small MP3s or can generate them.
-    # For a self-contained test, you might need to generate silent MP3s.
-    print("This script is intended to be used as a module.")
-    print("To test, ensure you have some MP3 files and call merge_mp3_files directly.")
-    # Example:
-    # create_dummy_mp3("dummy1.mp3", duration_ms=1000)
-    # create_dummy_mp3("dummy2.mp3", duration_ms=1500)
-    # merge_mp3_files(["dummy1.mp3", "dummy2.mp3"], "merged_output.mp3", pause_ms=200)
-    # os.remove("dummy1.mp3")
-    # os.remove("dummy2.mp3")
-    # os.remove("merged_output.mp3")
-    # Helper to create dummy files if needed for a more robust test
-    def create_dummy_mp3(filename, duration_ms=1000):
-        try:
-            silence = AudioSegment.silent(duration=duration_ms)
-            silence.export(filename, format="mp3")
-            print(f"Created dummy file: {filename}")
-        except Exception as e:
-            print(f"Could not create dummy MP3 {filename}. Ensure ffmpeg is installed and accessible. Error: {e}")
-    # Create dummy files for testing
-    dummy_files_exist = True
-    try:
-        create_dummy_mp3("test_dummy1.mp3", 1000)
-        create_dummy_mp3("test_dummy2.mp3", 1500)
-    except Exception:
-        dummy_files_exist = False
-        print("Skipping merge test as dummy files could not be created (ffmpeg issue?).")
-    if dummy_files_exist:
-        print("\nTesting merge_mp3_files...")
-        files_to_merge = ["test_dummy1.mp3", "test_dummy2.mp3", "non_existent_file.mp3"]
-        output_merged = "test_merged_audio.mp3"
-        result_path = merge_mp3_files(files_to_merge, output_merged, pause_ms=300)
-        if result_path and os.path.exists(result_path):
-            print(f"Successfully merged audio to: {result_path}")
-            # Simple check: merged file should be larger than individual (roughly)
-            merged_size = os.path.getsize(result_path)
-            dummy1_size = os.path.getsize("test_dummy1.mp3")
-            print(f"Size of {result_path}: {merged_size} bytes (dummy1 was {dummy1_size})")
-            if merged_size > dummy1_size : # crude check
-                 print("Merge test seems OK.")
-            else:
-                 print("Merged file size issue.")
-            os.remove(result_path)
         else:
-            print("Failed to merge audio.")
-        # Clean up dummy files
-        if os.path.exists("test_dummy1.mp3"): os.remove("test_dummy1.mp3")
-        if os.path.exists("test_dummy2.mp3"): os.remove("test_dummy2.mp3")

 def merge_mp3_files(file_paths, output_filename, pause_ms=500):
     """
     Merges multiple MP3 files into a single MP3 file with a specified pause
+    between each segment. Skips missing or empty files.
+    Args:
+        file_paths (list): A list of paths to MP3 files to merge.
+                           Can contain None entries for files that failed synthesis; these will be skipped.
+        output_filename (str): The path to save the merged MP3 file.
+        pause_ms (int): Duration of silence in milliseconds to add between segments.
+    Returns:
+        str: The path to the merged MP3 file if successful, None otherwise.
     """
     if not file_paths:
+        print("Warning: No file paths provided for merging.")
         return None
+    valid_segments = []
+    for file_path in file_paths:
+        if file_path and os.path.exists(file_path) and os.path.getsize(file_path) > 0:
+            try:
+                segment = AudioSegment.from_mp3(file_path)
+                valid_segments.append(segment)
+            except Exception as e:
+                print(f"Error loading audio segment from {file_path}: {e}. Skipping this file.")
+        elif file_path: # File path provided but file is missing or empty
+             print(f"Warning: File {file_path} is missing or empty. Skipping.")
+        # If file_path is None, it's silently skipped (already handled upstream)
+    if not valid_segments:
         print("No valid audio segments found to merge.")
         return None
+    # Start with the first valid segment
+    combined_audio = valid_segments[0]
+    # Add subsequent segments with pauses
+    if len(valid_segments) > 1:
+        pause_segment = AudioSegment.silent(duration=max(0, pause_ms)) # Ensure pause_ms is not negative
+        for segment in valid_segments[1:]:
+            combined_audio += pause_segment
+            combined_audio += segment
     try:
+        # Export the combined audio to MP3 format
+        # May require ffmpeg/libav to be installed and accessible in PATH
+        combined_audio.export(output_filename, format="mp3")
+        if os.path.exists(output_filename) and os.path.getsize(output_filename) > 0:
+            return output_filename
+        else:
+            print(f"Error: Merged file {output_filename} was not created or is empty after export.")
+            return None
     except Exception as e:
         print(f"Error exporting merged MP3 to {output_filename}: {e}")
         return None
+# Helper function to create dummy MP3 files for testing (requires pydub and ffmpeg)
+def _create_dummy_mp3(filename, duration_ms=1000, text_for_log="dummy"):
+    try:
+        # Create a silent audio segment
+        silence = AudioSegment.silent(duration=duration_ms)
+        # Export it as an MP3 file
+        silence.export(filename, format="mp3")
+        print(f"Successfully created dummy MP3: {filename} (duration: {duration_ms}ms) for '{text_for_log}'")
+        return True
+    except Exception as e:
+        print(f"Could not create dummy MP3 '{filename}'. Ensure ffmpeg is installed and accessible. Error: {e}")
+        return False
 if __name__ == '__main__':
+    print("--- Testing merge_mp3_files ---")
+    test_output_dir = "test_audio_merge_output"
+    os.makedirs(test_output_dir, exist_ok=True)
+    dummy_files = []
+    # Create some dummy MP3 files for the test
+    if _create_dummy_mp3(os.path.join(test_output_dir, "dummy1.mp3"), 1000, "Segment 1"):
+        dummy_files.append(os.path.join(test_output_dir, "dummy1.mp3"))
+    if _create_dummy_mp3(os.path.join(test_output_dir, "dummy2.mp3"), 1500, "Segment 2"):
+        dummy_files.append(os.path.join(test_output_dir, "dummy2.mp3"))
+    # Test case 1: Merge existing files
+    if len(dummy_files) == 2:
+        output_merged_1 = os.path.join(test_output_dir, "merged_test1.mp3")
+        print(f"\nAttempting to merge: {dummy_files} with 300ms pause into {output_merged_1}")
+        result_path_1 = merge_mp3_files(dummy_files, output_merged_1, pause_ms=300)
+        if result_path_1 and os.path.exists(result_path_1):
+            print(f"SUCCESS: Merged audio created at: {result_path_1} (Size: {os.path.getsize(result_path_1)} bytes)")
+        else:
+            print(f"FAILURE: Merging test case 1 failed.")
+    else:
+        print("\nSkipping merge test case 1 due to failure in creating dummy files.")
+    # Test case 2: Include a non-existent file and a None entry
+    files_with_issues = [
+        dummy_files[0] if dummy_files else None,
+        os.path.join(test_output_dir, "non_existent_file.mp3"),
+        None, # Representing a failed synthesis
+        dummy_files[1] if len(dummy_files) > 1 else None
+    ]
+    # Filter out None from the list if dummy files weren't created
+    files_with_issues_filtered = [f for f in files_with_issues if f is not None or isinstance(f, str)]
+    if any(f for f in files_with_issues_filtered if f and os.path.exists(f)): # Proceed if at least one valid dummy file exists
+        output_merged_2 = os.path.join(test_output_dir, "merged_test2_with_issues.mp3")
+        print(f"\nAttempting to merge: {files_with_issues_filtered} with 500ms pause into {output_merged_2}")
+        result_path_2 = merge_mp3_files(files_with_issues_filtered, output_merged_2, pause_ms=500)
+        if result_path_2 and os.path.exists(result_path_2):
+            print(f"SUCCESS (with skips): Merged audio created at: {result_path_2} (Size: {os.path.getsize(result_path_2)} bytes)")
         else:
+            print(f"NOTE: Merging test case 2 might result in fewer segments or failure if no valid files remained.")
+    else:
+         print("\nSkipping merge test case 2 as no valid dummy files were available.")
+    # Test case 3: Empty list of files
+    output_merged_3 = os.path.join(test_output_dir, "merged_test3_empty.mp3")
+    print(f"\nAttempting to merge an empty list of files into {output_merged_3}")
+    result_path_3 = merge_mp3_files([], output_merged_3, pause_ms=100)
+    if result_path_3 is None:
+        print("SUCCESS: Correctly handled empty file list (returned None).")
+    else:
+        print(f"FAILURE: Expected None for empty file list, got {result_path_3}")
+    # Test case 4: List with only None or invalid paths
+    output_merged_4 = os.path.join(test_output_dir, "merged_test4_all_invalid.mp3")
+    print(f"\nAttempting to merge list with only invalid/None files into {output_merged_4}")
+    result_path_4 = merge_mp3_files([None, "non_existent.mp3"], output_merged_4, pause_ms=100)
+    if result_path_4 is None:
+        print("SUCCESS: Correctly handled list with only invalid/None files (returned None).")
+    else:
+        print(f"FAILURE: Expected None for all-invalid list, got {result_path_4}")
+    print(f"\nTest finished. Check ./{test_output_dir}/ for any generated files.")
+    # You might want to add shutil.rmtree(test_output_dir) here for cleanup after visual inspection.

utils/openai_tts.py CHANGED Viewed

@@ -3,49 +3,64 @@ import os
 import time
 from openai import AsyncOpenAI, OpenAIError, RateLimitError
 import httpx # For NSFW check
-# Expanded list of voices based on recent OpenAI documentation
-OPENAI_VOICES = ['alloy', 'echo', 'fable', 'onyx', 'nova', 'shimmer', 'ash', 'ballad', 'coral', 'sage', 'verse']
-# Concurrency limiter
 MAX_CONCURRENT_REQUESTS = 2
 semaphore = asyncio.Semaphore(MAX_CONCURRENT_REQUESTS)
-# Retry mechanism
 MAX_RETRIES = 3
-INITIAL_BACKOFF_SECONDS = 1
 async def is_content_safe(text: str, api_url_template: str | None) -> bool:
     """
     Checks if the content is safe using an external NSFW API.
-    Returns True if safe or if API URL is not provided, False if unsafe.
     """
     if not api_url_template:
-        return True
     if "{text}" not in api_url_template:
-        print("Warning: NSFW_API_URL_TEMPLATE does not contain {text} placeholder. Skipping NSFW check.")
-        return True
     try:
-        encoded_text = httpx.utils.quote(text)
-        url = api_url_template.format(text=encoded_text)
-        async with httpx.AsyncClient() as client:
-            response = await client.get(url, timeout=10.0)
-        if response.status_code == 200:
-            return True
-        else:
-            print(f"NSFW Check: API request failed or content flagged. Status: {response.status_code}, Response: {response.text[:200]}")
-            return False
     except httpx.RequestError as e:
-        print(f"NSFW Check: API request error: {e}")
-        return False
     except Exception as e:
         print(f"NSFW Check: An unexpected error occurred: {e}")
-        return False
 async def synthesize_speech_line(
     client: AsyncOpenAI,
@@ -53,116 +68,157 @@ async def synthesize_speech_line(
     voice: str,
     output_path: str,
     model: str = "tts-1-hd",
-    speed: float = 1.0,
-    instructions: str | None = None,
     nsfw_api_url_template: str | None = None,
-    line_index: int = -1
 ) -> str | None:
     """
     Synthesizes a single line of text to speech using OpenAI TTS.
-    Includes speed and instructions parameters based on model compatibility.
-    Retries on RateLimitError with exponential backoff.
     Returns the output_path if successful, None otherwise.
     """
     if nsfw_api_url_template:
         if not await is_content_safe(text, nsfw_api_url_template):
-            print(f"Line {line_index if line_index != -1 else 'N/A'}: Content flagged as NSFW. Skipping synthesis.")
-            return None
     current_retry = 0
     backoff_seconds = INITIAL_BACKOFF_SECONDS
     async with semaphore:
-        while current_retry < MAX_RETRIES:
             try:
                 request_params = {
                     "model": model,
-                    "voice": voice,
                     "input": text,
-                    "response_format": "mp3"
                 }
-                # Add speed if model supports it and speed is not default
                 if model in ["tts-1", "tts-1-hd"]:
-                    if speed is not None and speed != 1.0: # OpenAI default is 1.0
-                        # Ensure speed is within valid range for safety, though UI should also constrain this
-                        clamped_speed = max(0.25, min(speed, 4.0))
-                        request_params["speed"] = clamped_speed
-                # Add instructions if model supports it and instructions are provided
-                # Assuming gpt-4o-mini-tts supports it, and tts-1/tts-1-hd do not.
-                if model not in ["tts-1", "tts-1-hd"] and instructions: # Example: gpt-4o-mini-tts
-                    request_params["instructions"] = instructions
                 response = await client.audio.speech.create(**request_params)
                 await response.astream_to_file(output_path)
-                return output_path
             except RateLimitError as e:
                 current_retry += 1
-                if current_retry >= MAX_RETRIES:
-                    print(f"Line {line_index if line_index != -1 else ''}: Max retries reached for RateLimitError. Error: {e}")
                     return None
-                print(f"Line {line_index if line_index != -1 else ''}: Rate limit hit. Retrying in {backoff_seconds}s... (Attempt {current_retry}/{MAX_RETRIES})")
                 await asyncio.sleep(backoff_seconds)
-                backoff_seconds *= 2
-            except OpenAIError as e:
-                print(f"Line {line_index if line_index != -1 else ''}: OpenAI API error: {e}")
                 return None
-            except Exception as e:
-                print(f"Line {line_index if line_index != -1 else ''}: An unexpected error occurred during synthesis: {e}")
-                return None
-    return None
 if __name__ == '__main__':
     async def main_test():
         api_key = os.getenv("OPENAI_API_KEY")
         if not api_key:
-            print("OPENAI_API_KEY not set. Skipping test.")
             return
         client = AsyncOpenAI(api_key=api_key)
-        test_lines = [
-            {"id": 0, "speaker": "Alice", "text": "Hello, this is a test line for Alice, spoken quickly."},
-            {"id": 1, "speaker": "Bob", "text": "And this is Bob, testing his voice with instructions.", "instructions": "Speak in a deep, resonant voice."},
-            {"id": 2, "speaker": "Alice", "text": "A short reply, spoken slowly.", "speed": 0.8},
-            {"id": 3, "speaker": "Charlie", "text": "Charlie here, normal speed."}
         ]
-        temp_dir = "test_audio_output_enhanced"
-        os.makedirs(temp_dir, exist_ok=True)
-        tasks = []
-        for i, line_data in enumerate(test_lines):
-            # Test with specific models to check param compatibility
-            # For Alice (speed): tts-1-hd. For Bob (instructions): gpt-4o-mini-tts
-            current_model = "tts-1-hd"
-            if "instructions" in line_data:
-                current_model = "gpt-4o-mini-tts" # Example, ensure this model is available for your key
-            voice = OPENAI_VOICES[i % len(OPENAI_VOICES)]
-            output_file = os.path.join(temp_dir, f"line_{line_data['id']}_{current_model}.mp3")
-            tasks.append(
                 synthesize_speech_line(
-                    client,
-                    line_data["text"],
-                    voice,
-                    output_file,
-                    model=current_model,
-                    speed=line_data.get("speed", 1.0),
                     instructions=line_data.get("instructions"),
                     line_index=line_data['id']
                 )
             )
-        results = await asyncio.gather(*tasks)
-        successful_files = [r for r in results if r]
-        print(f"\nSuccessfully synthesized {len(successful_files)} out of {len(test_lines)} lines.")
-        for f_path in successful_files:
-            print(f" - {f_path}")
-    if os.name == 'nt':
         asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
     asyncio.run(main_test())

 import time
 from openai import AsyncOpenAI, OpenAIError, RateLimitError
 import httpx # For NSFW check
+import urllib.parse # For URL encoding text in NSFW check
+# Voices available for OpenAI TTS models (tts-1, tts-1-hd, gpt-4o-mini-tts)
+# As of May 2024, these are the primary voices. Ash, Ballad, Coral, Sage, Verse were mentioned for GPT-4o's voice capabilities.
+OPENAI_VOICES = ['alloy', 'echo', 'fable', 'onyx', 'nova', 'shimmer']
+# If gpt-4o-mini-tts explicitly supports more/different voices, this list might need adjustment
+# or the app could query available voices if an API endpoint for that exists. For now, assume these are common.
+# Concurrency limiter for OpenAI API calls
 MAX_CONCURRENT_REQUESTS = 2
 semaphore = asyncio.Semaphore(MAX_CONCURRENT_REQUESTS)
+# Retry mechanism parameters
 MAX_RETRIES = 3
+INITIAL_BACKOFF_SECONDS = 1.0 # Start with 1 second
+MAX_BACKOFF_SECONDS = 16.0 # Cap backoff to avoid excessively long waits
 async def is_content_safe(text: str, api_url_template: str | None) -> bool:
     """
     Checks if the content is safe using an external NSFW API.
+    Returns True if safe, API URL is not provided, or check fails open.
+    Returns False if content is flagged as unsafe by the API.
     """
     if not api_url_template:
+        return True # No NSFW check configured, assume safe
     if "{text}" not in api_url_template:
+        print(f"Warning: NSFW_API_URL_TEMPLATE ('{api_url_template}') does not contain {{text}} placeholder. Skipping NSFW check.")
+        return True # Configuration error, fail open (assume safe)
     try:
+        encoded_text = urllib.parse.quote(text) # Ensure text is URL-safe
+        url = api_url_template.replace("{text}", encoded_text) # Use replace for simplicity
+        # Using a timeout for the external API call
+        async with httpx.AsyncClient(timeout=10.0) as client:
+            response = await client.get(url)
+        response.raise_for_status() # Will raise an exception for 4xx/5xx responses
+        # Assuming the API returns a specific response to indicate safety.
+        # This part needs to be adapted to the actual API's response format.
+        # For example, if it returns JSON: `data = response.json()`
+        # If it returns 200 for safe, and non-200 for unsafe, raise_for_status handles it.
+        # For this placeholder, we'll assume 200 means safe.
+        return True # Content is safe based on API response
+    except httpx.HTTPStatusError as e:
+        # Log specific HTTP errors from the NSFW API
+        print(f"NSFW Check: API request failed. Status: {e.response.status_code}. URL: {e.request.url}. Response: {e.response.text[:200]}")
+        # Depending on policy, you might "fail closed" (treat as unsafe) or "fail open"
+        return False # Content flagged as unsafe or API error
     except httpx.RequestError as e:
+        print(f"NSFW Check: API request error: {e}. URL: {e.request.url if e.request else 'N/A'}")
+        return True # Fail open (assume safe) on network/request errors to not block TTS
     except Exception as e:
         print(f"NSFW Check: An unexpected error occurred: {e}")
+        return True # Fail open (assume safe) on other unexpected errors
 async def synthesize_speech_line(
     client: AsyncOpenAI,
     voice: str,
     output_path: str,
     model: str = "tts-1-hd",
+    speed: float = 1.0, # Speed parameter (0.25 to 4.0). Default 1.0.
+    instructions: str | None = None, # For models like gpt-4o-mini-tts potentially
     nsfw_api_url_template: str | None = None,
+    line_index: int = -1 # For logging purposes
 ) -> str | None:
     """
     Synthesizes a single line of text to speech using OpenAI TTS.
+    Handles rate limiting with exponential backoff and NSFW checks.
     Returns the output_path if successful, None otherwise.
     """
+    if not text.strip():
+        print(f"Line {line_index if line_index != -1 else '(unknown)'}: Input text is empty. Skipping synthesis.")
+        return None
     if nsfw_api_url_template:
         if not await is_content_safe(text, nsfw_api_url_template):
+            print(f"Line {line_index if line_index != -1 else '(unknown)'}: Content flagged as potentially unsafe. Skipping synthesis.")
+            return None # Skip synthesis for flagged content
     current_retry = 0
     backoff_seconds = INITIAL_BACKOFF_SECONDS
+    # Acquire semaphore before entering retry loop
     async with semaphore:
+        while current_retry <= MAX_RETRIES:
             try:
                 request_params = {
                     "model": model,
                     "input": text,
+                    "voice": voice,
+                    "response_format": "mp3" # Explicitly request mp3
                 }
+                # Add speed if model is tts-1 or tts-1-hd and speed is not default 1.0
                 if model in ["tts-1", "tts-1-hd"]:
+                    # OpenAI API speed range is 0.25 to 4.0.
+                    # Clamp speed to be safe, although UI should also enforce this.
+                    clamped_speed = max(0.25, min(float(speed), 4.0))
+                    if clamped_speed != 1.0: # Only send if not default
+                         request_params["speed"] = clamped_speed
+                # Add instructions if provided and model is gpt-4o-mini-tts (or other future models supporting it)
+                # tts-1 and tts-1-hd do not support an 'instructions' parameter.
+                if model == "gpt-4o-mini-tts" and instructions and instructions.strip():
+                    request_params["instructions"] = instructions.strip()
+                # Log the request params being sent (excluding sensitive parts like full text if too long)
+                # print(f"Line {line_index}: Sending request to OpenAI TTS with params: {{'model': '{model}', 'voice': '{voice}', 'speed': {request_params.get('speed', 1.0)}, 'has_instructions': {bool(request_params.get('instructions'))}}}")
                 response = await client.audio.speech.create(**request_params)
+                # Stream response to file
                 await response.astream_to_file(output_path)
+                # Verify file was created and has content
+                if os.path.exists(output_path) and os.path.getsize(output_path) > 0:
+                    return output_path
+                else:
+                    print(f"Line {line_index if line_index != -1 else ''}: Synthesis appeared to succeed but output file is missing or empty: {output_path}")
+                    return None # File not created or empty
             except RateLimitError as e:
                 current_retry += 1
+                if current_retry > MAX_RETRIES:
+                    print(f"Line {line_index if line_index != -1 else ''}: Max retries reached due to RateLimitError. Error: {e}")
                     return None
+                # Exponential backoff with jitter could be added, but simple exponential for now
+                print(f"Line {line_index if line_index != -1 else ''}: Rate limit hit (Attempt {current_retry}/{MAX_RETRIES}). Retrying in {backoff_seconds:.2f}s...")
                 await asyncio.sleep(backoff_seconds)
+                backoff_seconds = min(backoff_seconds * 2, MAX_BACKOFF_SECONDS) # Increase backoff, cap at max
+            except OpenAIError as e: # Catch other specific OpenAI errors
+                print(f"Line {line_index if line_index != -1 else ''}: OpenAI API error during synthesis: {type(e).__name__} - {e}")
                 return None
+            except Exception as e: # Catch any other unexpected errors
+                print(f"Line {line_index if line_index != -1 else ''}: An unexpected error occurred during synthesis: {type(e).__name__} - {e}")
+                # current_retry += 1 # Could also retry on generic errors if deemed transient
+                # if current_retry > MAX_RETRIES: return None
+                # await asyncio.sleep(backoff_seconds)
+                # backoff_seconds = min(backoff_seconds * 2, MAX_BACKOFF_SECONDS)
+                return None # For most unexpected errors, safer not to retry indefinitely
+    # If loop finishes due to max retries without returning output_path
+    print(f"Line {line_index if line_index != -1 else ''}: Failed to synthesize after all retries or due to non-retryable error.")
+    return None
 if __name__ == '__main__':
     async def main_test():
         api_key = os.getenv("OPENAI_API_KEY")
         if not api_key:
+            print("OPENAI_API_KEY environment variable not set. Skipping test.")
             return
+        # Test with a mock NSFW API template
+        # Replace with a real one if you have one, or set to None to disable
+        mock_nsfw_template = "https://api.example.com/nsfw_check?text={text}" # This will likely fail open
         client = AsyncOpenAI(api_key=api_key)
+        test_lines_data = [
+            {"id": 0, "text": "Hello from Alloy, this is a test of standard tts-1-hd.", "voice": "alloy", "model": "tts-1-hd", "speed": 1.0},
+            {"id": 1, "text": "Echo here, speaking a bit faster.", "voice": "echo", "model": "tts-1-hd", "speed": 1.3},
+            {"id": 2, "text": "Fable, narrating slowly and calmly.", "voice": "fable", "model": "tts-1", "speed": 0.8},
+            {"id": 3, "text": "This is Onyx with instructions for gpt-4o-mini-tts: speak with a deep, commanding voice.", "voice": "onyx", "model": "gpt-4o-mini-tts", "instructions": "Speak with a very deep, commanding and slightly robotic voice."},
+            {"id": 4, "text": "Nova, testing default speed with tts-1.", "voice": "nova", "model": "tts-1"},
+            {"id": 5, "text": "Shimmer testing gpt-4o-mini-tts without specific instructions.", "voice": "shimmer", "model": "gpt-4o-mini-tts"},
+            {"id": 6, "text": "This line contains potentially naughty words that might be flagged.", "voice": "alloy", "model": "tts-1-hd", "nsfw_check": True}, # Test NSFW
+            {"id": 7, "text": "", "voice": "echo", "model": "tts-1"}, # Test empty text
         ]
+        temp_output_dir = "test_audio_output_openai_tts"
+        os.makedirs(temp_output_dir, exist_ok=True)
+        print(f"Test audio will be saved in ./{temp_output_dir}/")
+        synthesis_tasks = []
+        for line_data in test_lines_data:
+            output_file_path = os.path.join(temp_output_dir, f"line_{line_data['id']}_{line_data['voice']}_{line_data['model']}.mp3")
+            nsfw_url = mock_nsfw_template if line_data.get("nsfw_check") else None
+            synthesis_tasks.append(
                 synthesize_speech_line(
+                    client=client,
+                    text=line_data["text"],
+                    voice=line_data["voice"],
+                    output_path=output_file_path,
+                    model=line_data["model"],
+                    speed=line_data.get("speed", 1.0), # Default speed if not specified
                     instructions=line_data.get("instructions"),
+                    nsfw_api_url_template=nsfw_url,
                     line_index=line_data['id']
                 )
             )
+        results = await asyncio.gather(*synthesis_tasks)
+        successful_files_count = 0
+        print("\n--- Test Synthesis Results ---")
+        for i, result_path in enumerate(results):
+            if result_path and os.path.exists(result_path):
+                print(f"SUCCESS: Line {test_lines_data[i]['id']} -> {result_path} (Size: {os.path.getsize(result_path)} bytes)")
+                successful_files_count += 1
+            else:
+                print(f"FAILURE or SKIP: Line {test_lines_data[i]['id']} (Text: '{test_lines_data[i]['text'][:30]}...')")
+        print(f"\nSuccessfully synthesized {successful_files_count} out of {len(test_lines_data)} lines.")
+        print(f"Please check the ./{temp_output_dir}/ directory for output files.")
+    # Run the async main function
+    if os.name == 'nt': # Required for Windows asyncio selector policy
         asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
     asyncio.run(main_test())

utils/script_parser.py CHANGED Viewed

@@ -2,15 +2,15 @@ import re
 import math
 MAX_SCRIPT_LENGTH = 10000  # characters
-TTS_1_HD_COST_PER_CHAR = 0.00003  # $30 / 1M chars
-GPT_4O_MINI_TTS_COST_PER_SECOND = 0.015 / 60  # $0.015 / minute
-CHARS_PER_SECOND_ESTIMATE = 10 # Rough estimate for TTS duration
 def parse_dialogue_script(script_text):
     """
-    Parses a dialogue script into a list of (index, speaker, utterance) tuples.
     Input format: "[Speaker] Utterance" per line.
-    Lines not matching the format are attempted to be parsed as "[Default] Utterance".
     """
     lines = script_text.strip().split('\n')
     parsed_lines = []
@@ -22,22 +22,24 @@ def parse_dialogue_script(script_text):
     for i, line_content in enumerate(lines):
         line_content = line_content.strip()
         if not line_content:
-            continue
         match = re.match(r'\[(.*?)\]\s*(.*)', line_content)
         if match:
             speaker, utterance = match.groups()
             utterance = utterance.strip()
         else:
-            # If no speaker tag, assign a default speaker or handle as per requirements
-            # For now, let's assume the whole line is an utterance by a "Narrator" or similar
-            speaker = "Narrator" # Or consider raising an error/warning
-            utterance = line_content.strip()
-        if not utterance: # Skip if utterance is empty after parsing
             continue
-        parsed_lines.append({"id": i, "speaker": speaker.strip(), "text": utterance})
         total_chars += len(utterance)
     return parsed_lines, total_chars
@@ -46,38 +48,68 @@ def calculate_cost(total_chars, num_lines, model_name="tts-1-hd"):
     """
     Calculates the estimated cost for TTS processing.
     """
-    if model_name == "tts-1-hd":
-        cost = total_chars * TTS_1_HD_COST_PER_CHAR
     elif model_name == "gpt-4o-mini-tts":
-        # Estimate duration: total_chars / X chars per second
-        # This is a very rough estimate. Actual duration depends on OpenAI's model.
-        estimated_seconds = total_chars / CHARS_PER_SECOND_ESTIMATE
         cost = estimated_seconds * GPT_4O_MINI_TTS_COST_PER_SECOND
-    else:
-        raise ValueError(f"Unknown model for cost calculation: {model_name}")
     return cost
 if __name__ == '__main__':
-    sample_script = """
     [Alice] Hello Bob, how are you?
     [Bob] I'm fine, Alice. And you?
     This is a line without a speaker tag.
     [Charlie] Just listening in.
     """
-    parsed, chars = parse_dialogue_script(sample_script)
     print("Parsed Lines:")
     for p_line in parsed:
         print(p_line)
-    print(f"\nTotal Characters: {chars}")
     cost_hd = calculate_cost(chars, len(parsed), "tts-1-hd")
     print(f"Estimated cost for tts-1-hd: ${cost_hd:.6f}")
     cost_gpt_mini = calculate_cost(chars, len(parsed), "gpt-4o-mini-tts")
-    print(f"Estimated cost for gpt-4o-mini-tts: ${cost_gpt_mini:.6f}")
-    long_script = "a" * (MAX_SCRIPT_LENGTH + 1)
     try:
-        parse_dialogue_script(long_script)
     except ValueError as e:
-        print(f"Error for long script: {e}")

 import math
 MAX_SCRIPT_LENGTH = 10000  # characters
+TTS_1_HD_COST_PER_CHAR = 0.00003  # $30 / 1M chars for tts-1-hd and tts-1
+GPT_4O_MINI_TTS_COST_PER_SECOND = 0.015 / 60  # $0.015 / minute for gpt-4o-mini-tts
+CHARS_PER_SECOND_ESTIMATE = 12 # Average characters spoken per second, for estimation
 def parse_dialogue_script(script_text):
     """
+    Parses a dialogue script into a list of dictionaries, each representing a line.
     Input format: "[Speaker] Utterance" per line.
+    Lines not matching the format are assigned to a "Narrator" speaker.
     """
     lines = script_text.strip().split('\n')
     parsed_lines = []
     for i, line_content in enumerate(lines):
         line_content = line_content.strip()
         if not line_content:
+            continue # Skip empty lines
         match = re.match(r'\[(.*?)\]\s*(.*)', line_content)
         if match:
             speaker, utterance = match.groups()
+            speaker = speaker.strip()
             utterance = utterance.strip()
+            if not speaker: # If speaker tag is empty like "[] Text"
+                speaker = "UnknownSpeaker"
         else:
+            # If no speaker tag, assign the whole line as utterance by "Narrator"
+            speaker = "Narrator"
+            utterance = line_content # Already stripped
+        if not utterance: # Skip if utterance is empty after parsing (e.g. "[Speaker]" with no text)
             continue
+        parsed_lines.append({"id": i, "speaker": speaker, "text": utterance})
         total_chars += len(utterance)
     return parsed_lines, total_chars
     """
     Calculates the estimated cost for TTS processing.
     """
+    cost = 0.0
+    if model_name in ["tts-1", "tts-1-hd"]: # OpenAI charges same for tts-1 and tts-1-hd
+        cost = total_chars * TTS_1_HD_COST_PER_CHAR
     elif model_name == "gpt-4o-mini-tts":
+        # Estimate duration: total_chars / X chars per second. This is a rough estimate.
+        # OpenAI pricing for gpt-4o-mini's TTS is by character, similar to tts-1.
+        # As of latest check, gpt-4o-mini is priced same as tts-1.
+        # $0.000015 / char ($15.00 / 1M characters)
+        # Let's update cost for gpt-4o-mini-tts if it differs.
+        # The prompt says: "# seconds × $0.015   for gpt‑4o‑mini‑tts (0.015 USD / minute)"
+        # This conflicts with OpenAI's typical character-based TTS pricing.
+        # Assuming prompt's per-second pricing is the requirement for gpt-4o-mini-tts for this exercise.
+        if CHARS_PER_SECOND_ESTIMATE <= 0: # Avoid division by zero
+            estimated_seconds = total_chars / 10.0 # Fallback chars/sec
+        else:
+            estimated_seconds = total_chars / CHARS_PER_SECOND_ESTIMATE
         cost = estimated_seconds * GPT_4O_MINI_TTS_COST_PER_SECOND
+    else: # Fallback to character-based costing for any other tts-1 like model
+        cost = total_chars * TTS_1_HD_COST_PER_CHAR
+        # raise ValueError(f"Unknown model for cost calculation: {model_name}") # Or assume default if model not matched
     return cost
 if __name__ == '__main__':
+    sample_script_1 = """
     [Alice] Hello Bob, how are you?
     [Bob] I'm fine, Alice. And you?
     This is a line without a speaker tag.
     [Charlie] Just listening in.
+    [] This line has an empty speaker tag.
+    [EmptySpeakerText]
     """
+    print(f"--- Test Case 1: Mixed Script ---")
+    parsed, chars = parse_dialogue_script(sample_script_1)
     print("Parsed Lines:")
     for p_line in parsed:
         print(p_line)
+    print(f"\nTotal Characters for TTS: {chars}")
     cost_hd = calculate_cost(chars, len(parsed), "tts-1-hd")
     print(f"Estimated cost for tts-1-hd: ${cost_hd:.6f}")
+    cost_tts1 = calculate_cost(chars, len(parsed), "tts-1")
+    print(f"Estimated cost for tts-1: ${cost_tts1:.6f}")
+    # Test cost for gpt-4o-mini-tts using the per-second formula
     cost_gpt_mini = calculate_cost(chars, len(parsed), "gpt-4o-mini-tts")
+    print(f"Estimated cost for gpt-4o-mini-tts (at {CHARS_PER_SECOND_ESTIMATE} chars/sec): ${cost_gpt_mini:.6f}")
+    print(f"\n--- Test Case 2: Long Script (Boundary Check) ---")
+    long_script_text = "[SpeakerA] " + "a" * (MAX_SCRIPT_LENGTH - 11) # 11 chars for "[SpeakerA] "
+    parsed_long, chars_long = parse_dialogue_script(long_script_text)
+    print(f"Long script (length {len(long_script_text)} chars) parsed successfully. TTS Chars: {chars_long}")
     try:
+        too_long_script = "a" * (MAX_SCRIPT_LENGTH + 1)
+        parse_dialogue_script(too_long_script)
     except ValueError as e:
+        print(f"Correctly caught error for too long script: {e}")
+    print(f"\n--- Test Case 3: Empty and Invalid Scripts ---")
+    parsed_empty, chars_empty = parse_dialogue_script("")
+    print(f"Empty script: Parsed lines: {len(parsed_empty)}, Chars: {chars_empty}")
+    parsed_blank_lines, chars_blank_lines = parse_dialogue_script("\n\n[Speaker]\n\n")
+    print(f"Script with blank/invalid lines: Parsed lines: {len(parsed_blank_lines)}, Chars: {chars_blank_lines} (Result: {parsed_blank_lines})")