hellorahulk commited on
Commit
2302206
·
1 Parent(s): 58d3731

Add video caption app with Whisper auto-captioning and styling options

Browse files
Files changed (5) hide show
  1. .gitignore +41 -0
  2. README.md +25 -0
  3. app.py +633 -0
  4. requirements.txt +9 -0
  5. setup.sh +24 -0
.gitignore ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # Distribution / packaging
7
+ dist/
8
+ build/
9
+ *.egg-info/
10
+
11
+ # Virtual environments
12
+ venv/
13
+ env/
14
+ ENV/
15
+
16
+ # Jupyter Notebook
17
+ .ipynb_checkpoints
18
+
19
+ # Temporary files
20
+ temp/
21
+ tmp/
22
+ *.temp
23
+ *.tmp
24
+
25
+ # OS-specific files
26
+ .DS_Store
27
+ Thumbs.db
28
+
29
+ # Model weights/large files
30
+ *.pt
31
+ *.pth
32
+ *.model
33
+
34
+ # Logs
35
+ logs/
36
+ *.log
37
+
38
+ # Testing
39
+ .coverage
40
+ htmlcov/
41
+ .pytest_cache/
README.md CHANGED
@@ -9,4 +9,29 @@ app_file: app.py
9
  pinned: false
10
  ---
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
9
  pinned: false
10
  ---
11
 
12
+ # Video Caption Generator
13
+
14
+ This tool allows you to add captions to your videos with precise control over styling and positioning. You can either auto-generate captions using Whisper AI speech recognition or provide your own captions in SRT, ASS, or VTT format.
15
+
16
+ ## Features
17
+
18
+ - **Auto Caption Generation**: Extract and transcribe audio from your video using OpenAI's Whisper model
19
+ - **Manual Caption Support**: Input your own captions in popular formats (SRT, ASS, VTT)
20
+ - **Customizable Styling**: Control font, size, color, and positioning of captions
21
+ - **High-Quality Output**: Burn captions directly into your video with FFmpeg
22
+
23
+ ## How to Use
24
+
25
+ 1. Upload your video file
26
+ 2. Choose whether to auto-generate captions or provide your own
27
+ 3. Customize font, size, color, and alignment
28
+ 4. Click "Generate Captioned Video" and wait for processing
29
+ 5. Download the resulting video with embedded captions
30
+
31
+ Perfect for creating accessible content, adding subtitles to multilingual videos, or emphasizing important information in educational content.
32
+
33
+ ## Note
34
+
35
+ Processing time depends on video length and complexity. Auto-caption generation utilizes Whisper and may take longer for larger files.
36
+
37
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
app.py ADDED
@@ -0,0 +1,633 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import tempfile
3
+ import gradio as gr
4
+ import ffmpeg
5
+ import logging
6
+ import whisper as openai_whisper # Renamed to avoid potential conflicts
7
+ import numpy as np
8
+ import torch
9
+ import datetime
10
+ import subprocess
11
+ import shlex
12
+ from pathlib import Path
13
+ import re # For parsing ASS/SRT
14
+
15
+ # Configure logging
16
+ logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
17
+ logger = logging.getLogger(__name__)
18
+
19
+ # Define fonts directory - adapt for Hugging Face environment if needed
20
+ FONTS_DIR = '/usr/share/fonts/truetype' # Common Linux font location
21
+ # Check common font locations for other OS if needed
22
+ if not os.path.exists(FONTS_DIR) and os.path.exists('/System/Library/Fonts'): # macOS
23
+ FONTS_DIR = '/System/Library/Fonts'
24
+ elif not os.path.exists(FONTS_DIR) and os.path.exists('C:\Windows\Fonts'): # Windows
25
+ FONTS_DIR = 'C:\Windows\Fonts'
26
+
27
+ FONT_PATHS = {}
28
+ ACCEPTABLE_FONTS = ['Arial', 'Helvetica', 'Times New Roman'] # Start with common fallbacks
29
+ try:
30
+ if FONTS_DIR and os.path.exists(FONTS_DIR):
31
+ logger.info(f"Searching for fonts in: {FONTS_DIR}")
32
+ found_fonts = []
33
+ for root, dirs, files in os.walk(FONTS_DIR):
34
+ for file in files:
35
+ if file.lower().endswith(('.ttf', '.otf', '.ttc')):
36
+ font_path = os.path.join(root, file)
37
+ font_name = os.path.splitext(file)[0]
38
+ # Basic name cleanup
39
+ base_font_name = re.sub(r'[-_ ]?(bold|italic|regular|medium|light|condensed)?$', '', font_name, flags=re.IGNORECASE)
40
+ if base_font_name not in FONT_PATHS:
41
+ FONT_PATHS[base_font_name] = font_path
42
+ found_fonts.append(base_font_name)
43
+ if found_fonts:
44
+ ACCEPTABLE_FONTS = sorted(list(set(found_fonts + ACCEPTABLE_FONTS)))
45
+ logger.info(f"Found system fonts: {ACCEPTABLE_FONTS}")
46
+ else:
47
+ logger.warning(f"No font files found in {FONTS_DIR}. Using defaults.")
48
+ else:
49
+ logger.warning(f"Font directory {FONTS_DIR} not found. Using defaults: {ACCEPTABLE_FONTS}")
50
+ except Exception as e:
51
+ logger.warning(f"Could not load system fonts from {FONTS_DIR}: {e}. Using defaults: {ACCEPTABLE_FONTS}")
52
+
53
+ # Global variable for Whisper model to avoid reloading
54
+ whisper_model = None
55
+
56
+ def generate_style_line(options):
57
+ """Generate ASS style line from options. Uses common defaults.
58
+ Ensure color format is correct (&HBBGGRRAA or &HAABBGGRR depending on FFmpeg build)
59
+ Using &HBBGGRR format for PrimaryColour based on common FFmpeg usage.
60
+ """
61
+ # Convert hex color picker (#FFFFFF) to ASS format (&HBBGGRR)
62
+ def hex_to_ass_bgr(hex_color):
63
+ hex_color = hex_color.lstrip('#')
64
+ if len(hex_color) == 6:
65
+ r, g, b = tuple(int(hex_color[i:i+2], 16) for i in (0, 2, 4))
66
+ return f"&H{b:02X}{g:02X}{r:02X}"
67
+ return '&H00FFFFFF' # Default to white if format is wrong
68
+
69
+ primary_color_ass = hex_to_ass_bgr(options.get('primary_color', '#FFFFFF'))
70
+
71
+ style_options = {
72
+ 'Name': 'Default',
73
+ 'Fontname': options.get('font_name', 'Arial'), # Ensure this font is accessible to FFmpeg
74
+ 'Fontsize': options.get('font_size', 24),
75
+ 'PrimaryColour': primary_color_ass,
76
+ 'SecondaryColour': '&H000000FF', # Often unused, but good to define
77
+ 'OutlineColour': '&H00000000', # Black outline
78
+ 'BackColour': '&H80000000', # Semi-transparent black background/shadow
79
+ 'Bold': 0, # Use -1 for True, 0 for False in ASS
80
+ 'Italic': 0,
81
+ 'Underline': 0,
82
+ 'StrikeOut': 0,
83
+ 'ScaleX': 100,
84
+ 'ScaleY': 100,
85
+ 'Spacing': 0,
86
+ 'Angle': 0,
87
+ 'BorderStyle': 1, # 1 = Outline + Shadow
88
+ 'Outline': 2, # Outline thickness
89
+ 'Shadow': 1, # Shadow distance
90
+ 'Alignment': options.get('alignment', 2), # 2 = Bottom Center
91
+ 'MarginL': 10,
92
+ 'MarginR': 10,
93
+ 'MarginV': 10, # Bottom margin
94
+ 'Encoding': 1 # Default ANSI encoding
95
+ }
96
+ logger.info(f"Generated ASS Style Options: {style_options}")
97
+ return f"Style: {','.join(map(str, style_options.values()))}"
98
+
99
+ def transcribe_audio(audio_path, progress=None):
100
+ """Transcribe audio using Whisper ASR model."""
101
+ global whisper_model
102
+ logger.info(f"Starting transcription for: {audio_path}")
103
+ try:
104
+ if whisper_model is None:
105
+ safe_progress_update(progress, 0.1, "Loading Whisper model...")
106
+ device = "cuda" if torch.cuda.is_available() else "cpu"
107
+ logger.info(f"Using device: {device} for Whisper")
108
+ # Use a smaller model if only CPU is available to potentially speed things up
109
+ model_size = "base" if device == "cuda" else "tiny.en" # or "tiny"
110
+ logger.info(f"Loading Whisper model size: {model_size}")
111
+ whisper_model = openai_whisper.load_model(model_size, device=device)
112
+ safe_progress_update(progress, 0.3, "Model loaded, processing audio...")
113
+
114
+ result = whisper_model.transcribe(audio_path, fp16=torch.cuda.is_available())
115
+ logger.info(f"Transcription result (first 100 chars): {str(result)[:100]}")
116
+ safe_progress_update(progress, 0.7, "Transcription complete, formatting captions...")
117
+ return result
118
+ except Exception as e:
119
+ logger.exception(f"Error transcribing audio: {audio_path}") # Use logger.exception to include traceback
120
+ raise
121
+
122
+ def format_time(seconds):
123
+ """Format time in SRT/ASS format (H:MM:SS.ms)."""
124
+ # ASS format uses H:MM:SS.xx (hundredths of a second)
125
+ hundredths = int((seconds % 1) * 100)
126
+ s = int(seconds) % 60
127
+ m = int(seconds / 60) % 60
128
+ h = int(seconds / 3600)
129
+ return f"{h}:{m:02d}:{s:02d}.{hundredths:02d}"
130
+
131
+ def format_time_srt(seconds):
132
+ """Format time in SRT format (HH:MM:SS,ms)."""
133
+ ms = int((seconds % 1) * 1000)
134
+ s = int(seconds) % 60
135
+ m = int(seconds / 60) % 60
136
+ h = int(seconds / 3600)
137
+ return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
138
+
139
+ def generate_srt_from_transcript(segments):
140
+ """Convert whisper segments to SRT format."""
141
+ srt_content = ""
142
+ for i, segment in enumerate(segments):
143
+ start_time = format_time_srt(segment["start"])
144
+ end_time = format_time_srt(segment["end"])
145
+ text = segment["text"].strip()
146
+ srt_content += f"{i+1}\n{start_time} --> {end_time}\n{text}\n\n"
147
+ logger.info(f"Generated SRT (first 200 chars): {srt_content[:200]}")
148
+ return srt_content.strip()
149
+
150
+ def generate_ass_dialogue_line(segment, style_name='Default'):
151
+ """Generate a single ASS dialogue line from a segment."""
152
+ start_time = format_time(segment["start"])
153
+ end_time = format_time(segment["end"])
154
+ text = segment["text"].strip().replace('\n', '\\N') # Replace newline with ASS newline
155
+ # Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
156
+ return f"Dialogue: 0,{start_time},{end_time},{style_name},,0,0,0,,{text}"
157
+
158
+ def generate_ass_from_transcript(segments, style_options):
159
+ """Convert whisper segments to ASS format including style header."""
160
+ style_line = generate_style_line(style_options)
161
+ ass_header = f"""
162
+ [Script Info]
163
+ Title: Generated Captions
164
+ ScriptType: v4.00+
165
+ WrapStyle: 0
166
+ PlayResX: 384
167
+ PlayResY: 288
168
+ ScaledBorderAndShadow: yes
169
+
170
+ [V4+ Styles]
171
+ Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding
172
+ {style_line}
173
+
174
+ [Events]
175
+ Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
176
+ """
177
+ dialogue_lines = [generate_ass_dialogue_line(seg) for seg in segments]
178
+ full_ass_content = ass_header + "\n".join(dialogue_lines)
179
+ logger.info(f"Generated ASS (first 300 chars): {full_ass_content[:300]}")
180
+ return full_ass_content
181
+
182
+ def extract_audio(video_path, output_path):
183
+ """Extract audio from video file using ffmpeg subprocess."""
184
+ logger.info(f"Attempting to extract audio from {video_path} to {output_path}")
185
+ try:
186
+ command = [
187
+ "ffmpeg", "-i", video_path,
188
+ "-vn", # No video
189
+ "-acodec", "pcm_s16le", # Standard WAV format
190
+ "-ac", "1", # Mono
191
+ "-ar", "16000", # 16kHz sample rate (common for ASR)
192
+ "-y", # Overwrite output
193
+ output_path
194
+ ]
195
+ logger.info(f"Running audio extraction command: {' '.join(map(shlex.quote, command))}")
196
+ process = subprocess.run(
197
+ command,
198
+ stdout=subprocess.PIPE,
199
+ stderr=subprocess.PIPE,
200
+ text=True,
201
+ encoding='utf-8', # Explicitly set encoding
202
+ check=False
203
+ )
204
+
205
+ if process.returncode != 0:
206
+ logger.error(f"FFmpeg audio extraction error (Code {process.returncode}):\nSTDOUT:\n{process.stdout}\nSTDERR:\n{process.stderr}")
207
+ return False, f"FFmpeg failed (Code {process.returncode}): {process.stderr[:500]}..."
208
+
209
+ if not os.path.exists(output_path) or os.path.getsize(output_path) == 0:
210
+ logger.error(f"Audio extraction failed: Output file not created or empty. FFmpeg stderr: {process.stderr}")
211
+ return False, f"Output audio file not created or empty. FFmpeg stderr: {process.stderr[:500]}..."
212
+
213
+ logger.info(f"Audio extracted successfully to {output_path}, size: {os.path.getsize(output_path)} bytes")
214
+ return True, ""
215
+ except Exception as e:
216
+ logger.exception(f"Exception during audio extraction from {video_path}")
217
+ return False, str(e)
218
+
219
+ def run_ffmpeg_with_subtitles(video_path, subtitle_path, output_path, style_options=None):
220
+ """Burn subtitles into video using ffmpeg subprocess.
221
+
222
+ Args:
223
+ video_path: Path to input video
224
+ subtitle_path: Path to ASS subtitle file
225
+ output_path: Path to save output video
226
+ style_options: Optional style parameters (not directly used, but kept for consistency)
227
+
228
+ Returns:
229
+ tuple: (success, error_message)
230
+ """
231
+ logger.info(f"Attempting to burn subtitles from {subtitle_path} into {video_path}")
232
+
233
+ # Check if the subtitle file exists and is not empty
234
+ if not os.path.exists(subtitle_path) or os.path.getsize(subtitle_path) == 0:
235
+ return False, f"Subtitle file {subtitle_path} does not exist or is empty"
236
+
237
+ # Check if the video file exists
238
+ if not os.path.exists(video_path):
239
+ return False, f"Video file {video_path} does not exist"
240
+
241
+ # Validate the video file using ffprobe
242
+ try:
243
+ probe_cmd = [
244
+ "ffprobe", "-v", "error",
245
+ "-select_streams", "v:0",
246
+ "-show_entries", "stream=codec_name,width,height",
247
+ "-of", "json",
248
+ video_path
249
+ ]
250
+ probe_result = subprocess.run(
251
+ probe_cmd,
252
+ stdout=subprocess.PIPE,
253
+ stderr=subprocess.PIPE,
254
+ text=True,
255
+ encoding='utf-8'
256
+ )
257
+
258
+ if probe_result.returncode != 0:
259
+ logger.error(f"FFprobe validation failed: {probe_result.stderr}")
260
+ return False, f"FFprobe validation failed: {probe_result.stderr[:200]}..."
261
+ except Exception as e:
262
+ logger.exception(f"Exception during video validation: {video_path}")
263
+ return False, f"Video validation failed: {str(e)}"
264
+
265
+ try:
266
+ # The subtitle path needs to be properly escaped for the filter complex
267
+ # On Windows, backslashes need special handling
268
+ subtitle_path_esc = subtitle_path.replace('\\', '\\\\')
269
+
270
+ # Ensure paths are properly quoted for the shell command
271
+ command = [
272
+ "ffmpeg",
273
+ "-i", video_path,
274
+ "-vf", f"ass='{subtitle_path_esc}'",
275
+ "-c:v", "libx264", # Use H.264 codec for broad compatibility
276
+ "-preset", "medium", # Balance between speed and quality
277
+ "-crf", "23", # Reasonable quality setting (lower is better)
278
+ "-c:a", "aac", # Use AAC for audio
279
+ "-b:a", "128k", # Decent audio bitrate
280
+ "-movflags", "+faststart", # Optimize for web playback
281
+ "-y", # Overwrite output if exists
282
+ output_path
283
+ ]
284
+
285
+ logger.info(f"Running subtitle burn command: {' '.join(map(shlex.quote, command))}")
286
+
287
+ process = subprocess.run(
288
+ command,
289
+ stdout=subprocess.PIPE,
290
+ stderr=subprocess.PIPE,
291
+ text=True,
292
+ encoding='utf-8',
293
+ check=False
294
+ )
295
+
296
+ if process.returncode != 0:
297
+ logger.error(f"FFmpeg subtitle burn error (Code {process.returncode}):\nSTDOUT:\n{process.stdout}\nSTDERR:\n{process.stderr}")
298
+ return False, f"FFmpeg failed (Code {process.returncode}): {process.stderr[:500]}..."
299
+
300
+ # Verify output file was created and is not empty
301
+ if not os.path.exists(output_path) or os.path.getsize(output_path) == 0:
302
+ logger.error(f"Subtitle burning failed: Output file not created or empty. FFmpeg stderr: {process.stderr}")
303
+ return False, f"Output video file not created or empty. FFmpeg stderr: {process.stderr[:500]}..."
304
+
305
+ logger.info(f"Subtitles burned successfully, output: {output_path}, size: {os.path.getsize(output_path)} bytes")
306
+ return True, ""
307
+
308
+ except Exception as e:
309
+ logger.exception(f"Exception during subtitle burning: {video_path}")
310
+ return False, str(e)
311
+
312
+ def safe_progress_update(progress_callback, value, desc=""):
313
+ """Safely update progress without crashing if progress_callback is None or fails."""
314
+ if progress_callback is not None:
315
+ try:
316
+ progress_callback(value, desc)
317
+ except Exception as e:
318
+ # Avoid flooding logs for simple progress updates
319
+ # logger.warning(f"Progress update progress failed: {e}")
320
+ pass # Silently ignore progress update errors
321
+
322
+ def parse_srt_to_dialogue(srt_content):
323
+ """Basic SRT parser to list of dialogue events for ASS conversion."""
324
+ dialogue = []
325
+ # Regex to find index, timecodes, and text blocks
326
+ # Allows comma or period for milliseconds separator
327
+ pattern = re.compile(
328
+ r'^\s*(\d+)\s*$\n?' # Index line
329
+ r'(\d{1,2}):(\d{2}):(\d{2})[,.](\d{3})\s*-->\s*' # Start time
330
+ r'(\d{1,2}):(\d{2}):(\d{2})[,.](\d{3})\s*$\n' # End time
331
+ r'(.*?)(?=\n\s*\n\d+\s*$|\Z)', # Text block (non-greedy) until blank line and next index or end of string
332
+ re.DOTALL | re.MULTILINE
333
+ )
334
+
335
+ logger.info("Attempting to parse SRT/VTT content...")
336
+ matches_found = 0
337
+ last_index = 0
338
+ for match in pattern.finditer(srt_content):
339
+ matches_found += 1
340
+ try:
341
+ index = int(match.group(1))
342
+ sh, sm, ss, sms = map(int, match.group(2, 3, 4, 5))
343
+ eh, em, es, ems = map(int, match.group(6, 7, 8, 9))
344
+ start_sec = sh * 3600 + sm * 60 + ss + sms / 1000.0
345
+ end_sec = eh * 3600 + em * 60 + es + ems / 1000.0
346
+ text = match.group(10).strip().replace('\n', '\\N') # Replace newline with ASS \N
347
+
348
+ # Basic validation
349
+ if end_sec < start_sec:
350
+ logger.warning(f"SRT parse warning: End time {end_sec} before start time {start_sec} at index {index}. Skipping.")
351
+ continue
352
+ if not text:
353
+ logger.warning(f"SRT parse warning: Empty text content at index {index}. Skipping.")
354
+ continue
355
+
356
+ dialogue.append({'start': start_sec, 'end': end_sec, 'text': text})
357
+ last_index = match.end()
358
+
359
+ except Exception as e:
360
+ logger.warning(f"Could not parse SRT block starting near index {match.group(1)}: {e}")
361
+
362
+ # Check if parsing consumed a reasonable amount of the input
363
+ if matches_found > 0 and last_index < len(srt_content) * 0.8:
364
+ logger.warning(f"SRT parsing finished early. Found {matches_found} blocks, but stopped near character {last_index} of {len(srt_content)}. Input format might be inconsistent.")
365
+ elif matches_found == 0 and len(srt_content) > 10:
366
+ logger.error(f"SRT parsing failed. No dialogue blocks found in content starting with: {srt_content[:100]}...")
367
+
368
+ logger.info(f"Parsed {len(dialogue)} dialogue events from SRT/VTT content.")
369
+ return dialogue
370
+
371
+ def parse_ass_to_dialogue(ass_content):
372
+ """Basic ASS parser to extract dialogue events."""
373
+ dialogue = []
374
+ # Regex for ASS Dialogue line - make capturing groups non-optional where possible
375
+ # Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
376
+ pattern = re.compile(
377
+ r'^Dialogue:\s*'
378
+ r'(?P<layer>\d+),\s*'
379
+ r'(?P<start>\d+:\d{2}:\d{2}\.\d{2}),\s*'
380
+ r'(?P<end>\d+:\d{2}:\d{2}\.\d{2}),\s*'
381
+ r'(?P<style>[^,]*),\s*' # Style name
382
+ r'(?P<name>[^,]*),\s*' # Actor name
383
+ r'(?P<marginL>\d+),\s*'
384
+ r'(?P<marginR>\d+),\s*'
385
+ r'(?P<marginV>\d+),\s*'
386
+ r'(?P<effect>[^,]*),\s*' # Effect
387
+ r'(?P<text>.*?)$', # Text (rest of line)
388
+ re.IGNORECASE
389
+ )
390
+
391
+ # Helper to convert H:MM:SS.xx to seconds
392
+ def time_to_seconds(time_str):
393
+ try:
394
+ parts = time_str.split(':')
395
+ h = int(parts[0])
396
+ m = int(parts[1])
397
+ s_parts = parts[2].split('.')
398
+ s = int(s_parts[0])
399
+ cs = int(s_parts[1])
400
+ return h * 3600 + m * 60 + s + cs / 100.0
401
+ except Exception as e:
402
+ logger.error(f"Failed to parse time string '{time_str}': {e}")
403
+ return 0.0 # Return 0 on failure to avoid crashing, but log it
404
+
405
+ logger.info("Attempting to parse ASS content...")
406
+ lines_parsed = 0
407
+ for line in ass_content.splitlines():
408
+ line = line.strip()
409
+ if not line.lower().startswith('dialogue:'):
410
+ continue
411
+
412
+ match = pattern.match(line)
413
+ if match:
414
+ lines_parsed += 1
415
+ try:
416
+ start_sec = time_to_seconds(match.group('start'))
417
+ end_sec = time_to_seconds(match.group('end'))
418
+ text = match.group('text').strip() # Already handles \N from ASS spec
419
+
420
+ if end_sec < start_sec:
421
+ logger.warning(f"ASS parse warning: End time {end_sec} before start time {start_sec} in line: '{line}'. Skipping.")
422
+ continue
423
+ if not text:
424
+ logger.warning(f"ASS parse warning: Empty text content in line: '{line}'. Skipping.")
425
+ continue
426
+
427
+ dialogue.append({'start': start_sec, 'end': end_sec, 'text': text})
428
+ except Exception as e:
429
+ logger.warning(f"Could not parse ASS dialogue line: '{line}'. Error: {e}")
430
+ else:
431
+ logger.warning(f"ASS dialogue line did not match expected pattern: '{line}'")
432
+
433
+ if lines_parsed == 0 and len(ass_content) > 50: # Check if content was substantial
434
+ logger.error(f"ASS parsing failed. No dialogue lines matched the expected pattern in content starting with: {ass_content[:200]}...")
435
+
436
+ logger.info(f"Parsed {len(dialogue)} dialogue events from {lines_parsed} matched ASS lines.")
437
+ return dialogue
438
+
439
+ def process_video_with_captions(video, captions, caption_type, font_name, font_size,
440
+ primary_color, alignment, auto_caption):
441
+ """Main processing function."""
442
+ progress = gr.Progress(track_tqdm=True)
443
+ temp_dir = None
444
+ try:
445
+ progress(0, desc="Initializing...")
446
+ temp_dir = tempfile.mkdtemp()
447
+ logger.info(f"Created temp dir: {temp_dir}")
448
+
449
+ video_path = os.path.join(temp_dir, "input_video.mp4")
450
+ output_path = os.path.join(temp_dir, "output_video.mp4")
451
+ # Removed initial_subtitle_path, only need final
452
+ final_ass_path = os.path.join(temp_dir, "captions_final.ass")
453
+
454
+ # --- Handle Video Input ---
455
+ progress(0.05, desc="Saving video...")
456
+ if hasattr(video, 'name') and video.name and os.path.exists(video.name):
457
+ import shutil
458
+ shutil.copy(video.name, video_path)
459
+ logger.info(f"Copied input video from Gradio temp file {video.name} to {video_path}")
460
+ elif isinstance(video, str) and os.path.exists(video):
461
+ import shutil
462
+ shutil.copy(video, video_path)
463
+ logger.info(f"Copied input video from path {video} to {video_path}")
464
+ else:
465
+ raise gr.Error("Could not access uploaded video file. Please try uploading again.")
466
+
467
+ # --- Prepare Styles ---
468
+ progress(0.1, desc="Preparing styles...")
469
+ generated_captions_display_text = ""
470
+ alignment_map = {"Bottom Center": 2, "Bottom Left": 1, "Bottom Right": 3}
471
+ style_options = {
472
+ 'font_name': font_name,
473
+ 'font_size': font_size,
474
+ 'primary_color': primary_color,
475
+ 'alignment': alignment_map.get(alignment, 2)
476
+ }
477
+
478
+ # --- Auto-Generate or Process Provided Captions ---
479
+ dialogue_events = [] # To hold {'start': float, 'end': float, 'text': str}
480
+
481
+ if auto_caption:
482
+ logger.info("Auto-generating captions...")
483
+ progress(0.15, desc="Extracting audio...")
484
+ audio_path = os.path.join(temp_dir, "audio.wav")
485
+ success, error_msg = extract_audio(video_path, audio_path)
486
+ if not success: raise gr.Error(f"Audio extraction failed: {error_msg}")
487
+
488
+ progress(0.25, desc="Transcribing audio...")
489
+ transcript = transcribe_audio(audio_path, progress=progress)
490
+ if not transcript or not transcript.get("segments"): raise gr.Error("No speech detected.")
491
+ dialogue_events = transcript["segments"] # Use segments directly
492
+ progress(0.6, desc="Generating ASS captions...")
493
+
494
+ else: # Use provided captions
495
+ logger.info(f"Using provided {caption_type} captions.")
496
+ if not captions or captions.strip() == "": raise gr.Error("Caption input is empty.")
497
+
498
+ progress(0.6, desc=f"Processing {caption_type} captions...")
499
+ if caption_type.lower() == 'ass':
500
+ logger.info("Parsing provided ASS content.")
501
+ dialogue_events = parse_ass_to_dialogue(captions)
502
+ if not dialogue_events:
503
+ raise gr.Error("Could not parse dialogue lines from provided ASS content.")
504
+ elif caption_type.lower() in ['srt', 'vtt']:
505
+ logger.info(f"Parsing provided {caption_type} content.")
506
+ dialogue_events = parse_srt_to_dialogue(captions)
507
+ if not dialogue_events:
508
+ raise gr.Error(f"Could not parse provided {caption_type} content.")
509
+ else:
510
+ raise gr.Error(f"Unsupported caption type: {caption_type}")
511
+
512
+ # --- Generate Final ASS File ---
513
+ if not dialogue_events:
514
+ raise gr.Error("No caption dialogue events found or generated.")
515
+
516
+ logger.info(f"Generating final ASS file with {len(dialogue_events)} events and UI styles.")
517
+ final_ass_content = generate_ass_from_transcript(dialogue_events, style_options)
518
+ generated_captions_display_text = final_ass_content # Show the final generated ASS
519
+
520
+ with open(final_ass_path, 'w', encoding='utf-8') as f:
521
+ f.write(final_ass_content)
522
+ logger.info(f"Written final styled ASS to {final_ass_path}")
523
+
524
+ # Verify file creation
525
+ if not os.path.exists(final_ass_path) or os.path.getsize(final_ass_path) == 0:
526
+ raise gr.Error(f"Internal error: Failed to write final ASS file to {final_ass_path}")
527
+
528
+ # --- Burn Subtitles ---
529
+ progress(0.7, desc="Burning subtitles into video...")
530
+ success, error_msg = run_ffmpeg_with_subtitles(
531
+ video_path, final_ass_path, output_path, style_options
532
+ )
533
+ if not success:
534
+ logger.error(f"Subtitle burning failed. Video: {video_path}, ASS: {final_ass_path}")
535
+ raise gr.Error(f"FFmpeg failed to burn subtitles: {error_msg}")
536
+
537
+ progress(1.0, desc="Processing complete!")
538
+ logger.info(f"Output video generated: {output_path}")
539
+
540
+ return output_path, generated_captions_display_text
541
+
542
+ except Exception as e:
543
+ logger.exception(f"Error in process_video_with_captions")
544
+ if temp_dir and os.path.exists(temp_dir):
545
+ try:
546
+ files = os.listdir(temp_dir)
547
+ logger.error(f"Files in temp dir {temp_dir} during error: {files}")
548
+ except Exception as list_e:
549
+ logger.error(f"Could not list temp dir {temp_dir}: {list_e}")
550
+ if isinstance(e, gr.Error): raise e
551
+ else: raise gr.Error(f"An unexpected error occurred: {str(e)}")
552
+
553
+ # Function to toggle interactivity
554
+ def toggle_captions_input(auto_generate):
555
+ """Toggle the interactivity of the captions input."""
556
+ return gr.update(interactive=not auto_generate)
557
+
558
+ # --- Gradio Interface ---
559
+ with gr.Blocks(title="Video Caption Generator") as app:
560
+ gr.Markdown("## Video Caption Generator")
561
+ gr.Markdown("Upload a video, choose styling, and add captions. Use auto-generation or provide your own SRT/ASS/VTT.")
562
+
563
+ with gr.Row():
564
+ with gr.Column(scale=1):
565
+ gr.Markdown("**Input & Options**")
566
+ video_input = gr.Video(label="Upload Video")
567
+ auto_caption = gr.Checkbox(label="Auto-generate captions (Overrides below)", value=False)
568
+ captions_input = gr.Textbox(
569
+ label="Or Enter Captions Manually",
570
+ placeholder="1\n00:00:01,000 --> 00:00:05,000\nHello World\n\n2\n...",
571
+ lines=8,
572
+ interactive=True
573
+ )
574
+ caption_type = gr.Dropdown(
575
+ choices=["srt", "ass", "vtt"],
576
+ value="srt",
577
+ label="Format (if providing captions manually)"
578
+ )
579
+
580
+ gr.Markdown("**Caption Styling** (Applied to auto-generated or converted ASS)")
581
+ with gr.Row():
582
+ font_name = gr.Dropdown(
583
+ choices=ACCEPTABLE_FONTS,
584
+ value=ACCEPTABLE_FONTS[0] if ACCEPTABLE_FONTS else "Arial",
585
+ label="Font"
586
+ )
587
+ font_size = gr.Slider(minimum=10, maximum=60, value=24, step=1, label="Font Size")
588
+ with gr.Row():
589
+ primary_color = gr.ColorPicker(value="#FFFFFF", label="Text Color")
590
+ alignment = gr.Dropdown(
591
+ choices=["Bottom Center", "Bottom Left", "Bottom Right"],
592
+ value="Bottom Center",
593
+ label="Alignment"
594
+ )
595
+
596
+ process_btn = gr.Button("Generate Captioned Video", variant="primary")
597
+
598
+ with gr.Column(scale=1):
599
+ gr.Markdown("**Output**")
600
+ video_output = gr.Video(label="Captioned Video")
601
+ generated_captions_output = gr.Textbox(
602
+ label="Generated Captions (ASS format if auto-generated)",
603
+ lines=10,
604
+ interactive=False
605
+ )
606
+
607
+ # Link checkbox to captions input interactivity
608
+ auto_caption.change(
609
+ fn=toggle_captions_input,
610
+ inputs=[auto_caption],
611
+ outputs=[captions_input]
612
+ )
613
+
614
+ # Define the main processing function call for the button
615
+ process_btn.click(
616
+ fn=process_video_with_captions,
617
+ inputs=[
618
+ video_input,
619
+ captions_input,
620
+ caption_type,
621
+ font_name,
622
+ font_size,
623
+ primary_color,
624
+ alignment,
625
+ auto_caption
626
+ ],
627
+ outputs=[video_output, generated_captions_output],
628
+ # api_name="generate_captions"
629
+ )
630
+
631
+ # Launch the app
632
+ if __name__ == "__main__":
633
+ app.launch(debug=True, share=False) # Enable debug for local testing
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ gradio>=3.50.2
2
+ ffmpeg-python>=0.2.0
3
+ opencv-python-headless>=4.8.0
4
+ numpy>=1.22.0
5
+ openai-whisper>=20231117
6
+ tqdm>=4.66.0
7
+ torch>=2.0.0
8
+ transformers>=4.35.0
9
+ pathlib>=1.0.1
setup.sh ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/bin/bash
2
+
3
+ # Install FFmpeg if not already installed
4
+ if ! command -v ffmpeg &> /dev/null
5
+ then
6
+ echo "FFmpeg not found, installing..."
7
+ apt-get update && apt-get install -y ffmpeg
8
+ else
9
+ echo "FFmpeg is already installed"
10
+ fi
11
+
12
+ # Install FFprobe if not already installed (should come with FFmpeg but checking to be safe)
13
+ if ! command -v ffprobe &> /dev/null
14
+ then
15
+ echo "FFprobe not found, installing..."
16
+ apt-get update && apt-get install -y ffmpeg
17
+ else
18
+ echo "FFprobe is already installed"
19
+ fi
20
+
21
+ # Make sure the script has appropriate permissions in case it needs execution
22
+ chmod -R 755 .
23
+
24
+ echo "Setup complete!"