File size: 3,029 Bytes
96d549d 91251fa 96d549d 91251fa 96d549d 91251fa 96d549d 91251fa 96d549d 91251fa 96d549d 91251fa 96d549d 91251fa 96d549d 91251fa 96d549d 91251fa 96d549d 91251fa 96d549d 91251fa 96d549d 91251fa 96d549d 91251fa 96d549d 91251fa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
# Advanced Speech Processing with faster-whisper
Welcome to the advanced speech processing utility leveraging the powerful Whisper large-v2 model for the CTranslate2
framework. This tool is designed for high-performance speech recognition and processing, supporting a wide array of
languages and the capability to handle video inputs for slide detection and audio transcription.
## Features
- **Language Support**: Extensive language support covering major global languages for speech recognition tasks.
- **Video Processing**: Download MP4 files from links and extract audio content for transcription.
- **Slide Detection**: Detect and sort presentation slides from video lectures or meetings.
- **Audio Transcription**: Leverage the Whisper large-v2 model to transcribe audio content with high accuracy.
## Getting Started
To begin using this utility, set up the `WhisperModel` from the `faster_whisper` package with the provided language
configurations. The `EndpointHandler` class is your main interface for processing the data.
### Example Usage
```python
import requests
import os
# Sample data dict with the link to the video file and the desired language for transcription
DATA = {
"inputs": "<base64_encoded_audio_string>",
"link": "<your_mp4_video_link>",
"language": "en", # Choose from supported languages
"task": "transcribe",
"type": "audio" # Use "link" for video files
}
HF_ACCESS_TOKEN = os.environ.get("HF_TRANSCRIPTION_ACCESS_TOKEN")
API_URL = os.environ.get("HF_TRANSCRIPTION_ENDPOINT")
HEADERS = {
"Authorization": HF_ACCESS_TOKEN,
"Content-Type": "application/json"
}
response = requests.post(API_URL, headers=HEADERS, json=DATA)
print(response)
# The response will contain transcribed audio and detected slides if a video link was provided
```
### Processing Video Files
To process video files, the `process_video` function downloads the MP4 file, extracts the audio, and passes it to the
Whisper model for transcription. It also utilizes the `Detector` and `SlideSorter` classes to identify and sort
presentation slides within the video.
### Error Handling
Comprehensive logging and error handling are in place to ensure you're informed of each step's success or failure.
## Installation
Ensure that you have the following dependencies installed:
```plaintext
opencv-python~=4.8.1.78
numpy~=1.26.1
Pillow~=10.0.1
tqdm~=4.66.1
requests~=2.31.0
moviepy~=1.0.3
scipy~=1.11.3
```
Install them using pip with the provided `requirements.txt` file:
```bash
pip install -r requirements.txt
```
## Languages Supported
This tool supports a plethora of languages, making it highly versatile for global applications. The full list of
supported languages can be found in the `language` section of the old README.
## License
This project is available under the MIT license.
## More Information
For more information about the original Whisper large-v2 model, please refer to
its [model card on Hugging Face](https://huggingface.co/openai/whisper-large-v2).
---
|