File size: 3,029 Bytes
96d549d
 
 
 
 
91251fa
96d549d
91251fa
96d549d
 
 
 
91251fa
96d549d
91251fa
96d549d
 
 
 
91251fa
 
96d549d
 
 
 
 
 
 
 
 
 
 
91251fa
96d549d
 
91251fa
96d549d
 
 
 
 
 
 
 
 
91251fa
 
96d549d
 
 
 
 
 
 
 
 
91251fa
96d549d
91251fa
96d549d
 
 
 
 
 
 
 
 
 
91251fa
96d549d
 
 
 
 
91251fa
 
96d549d
 
 
 
 
 
91251fa
96d549d
 
 
 
 
 
 
 
91251fa
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
# Advanced Speech Processing with faster-whisper

Welcome to the advanced speech processing utility leveraging the powerful Whisper large-v2 model for the CTranslate2
framework. This tool is designed for high-performance speech recognition and processing, supporting a wide array of
languages and the capability to handle video inputs for slide detection and audio transcription.

## Features

- **Language Support**: Extensive language support covering major global languages for speech recognition tasks.
- **Video Processing**: Download MP4 files from links and extract audio content for transcription.
- **Slide Detection**: Detect and sort presentation slides from video lectures or meetings.
- **Audio Transcription**: Leverage the Whisper large-v2 model to transcribe audio content with high accuracy.

## Getting Started

To begin using this utility, set up the `WhisperModel` from the `faster_whisper` package with the provided language
configurations. The `EndpointHandler` class is your main interface for processing the data.

### Example Usage

```python
import requests
import os

# Sample data dict with the link to the video file and the desired language for transcription
DATA = {
    "inputs": "<base64_encoded_audio_string>",
    "link": "<your_mp4_video_link>",
    "language": "en",  # Choose from supported languages
    "task": "transcribe",
    "type": "audio"  # Use "link" for video files
}

HF_ACCESS_TOKEN = os.environ.get("HF_TRANSCRIPTION_ACCESS_TOKEN")
API_URL = os.environ.get("HF_TRANSCRIPTION_ENDPOINT")

HEADERS = {
    "Authorization": HF_ACCESS_TOKEN,
    "Content-Type": "application/json"
}

response = requests.post(API_URL, headers=HEADERS, json=DATA)
print(response)

# The response will contain transcribed audio and detected slides if a video link was provided
```

### Processing Video Files

To process video files, the `process_video` function downloads the MP4 file, extracts the audio, and passes it to the
Whisper model for transcription. It also utilizes the `Detector` and `SlideSorter` classes to identify and sort
presentation slides within the video.

### Error Handling

Comprehensive logging and error handling are in place to ensure you're informed of each step's success or failure.

## Installation

Ensure that you have the following dependencies installed:

```plaintext
opencv-python~=4.8.1.78
numpy~=1.26.1
Pillow~=10.0.1
tqdm~=4.66.1
requests~=2.31.0
moviepy~=1.0.3
scipy~=1.11.3
```

Install them using pip with the provided `requirements.txt` file:

```bash
pip install -r requirements.txt
```

## Languages Supported

This tool supports a plethora of languages, making it highly versatile for global applications. The full list of
supported languages can be found in the `language` section of the old README.

## License

This project is available under the MIT license.

## More Information

For more information about the original Whisper large-v2 model, please refer to
its [model card on Hugging Face](https://huggingface.co/openai/whisper-large-v2).

---