File size: 2,963 Bytes
1b8aef5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
import json
import os
import cv2

from transformers import BlipProcessor, BlipForConditionalGeneration

# model_id = "Salesforce/blip-image-captioning-base"
model_id = "Salesforce/blip-image-captioning-large"
captioning_processor = BlipProcessor.from_pretrained(model_id)
captioning_model = BlipForConditionalGeneration.from_pretrained(model_id)


def extract_frames(video_path, output_folder, interval_ms=2000) -> None:
    """
    Extracts frames from a video into an output folder at a specified time
    interval. Frames are saved as *.jpg images.

    Args:
        video_path:     The file name of the video to sample.
        output_folder:  The output directory for the extracted frames.
        interval_ms:    The sampling interval in milliseconds.
                        NOTE: No anti-aliasing filter is applied.
    """
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)  # Get fps
    # Compute sampling interval in number of frames to skip
    interval_frames = int(fps * interval_ms * 0.001)

    frame_count = 0
    saved_frame_count = 0

    while True:
        ret, frame = cap.read()
        if not ret:
            break

        # Keep only selected frames
        if frame_count % interval_frames == 0:
            frame_filename = os.path.join(
                output_folder,
                f"frame_{saved_frame_count:04d}.jpg"
            )
            cv2.imwrite(frame_filename, frame)
            saved_frame_count += 1

        frame_count += 1

    cap.release()


def extract_frame_captions(
    video_path,
    interval_ms=2000
) -> str:
    """
    Extracts frame captions from a video at a specified time
    interval.

    Args:
        video_path:     The file name of the video to sample.
        interval_ms:    The sampling interval in milliseconds.
                        NOTE: No anti-aliasing filter is applied.

    Returns:
        Frame descriptions as a list of strings.
    """
    cap = cv2.VideoCapture(video_path)
    fps = cap.get(cv2.CAP_PROP_FPS)  # Get fps
    # Compute sampling interval in number of frames to skip
    interval_frames = int(fps * interval_ms * 0.001)

    frame_count = 0
    saved_frame_count = 0

    captions = []
    while True:
        ret, frame = cap.read()
        if not ret:
            break

        # Keep only selected frames
        if frame_count % interval_frames == 0:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            inputs = captioning_processor(
                frame,
                return_tensors="pt"
            )
            out = captioning_model.generate(**inputs)
            cur_caption = (
                captioning_processor.decode(out[0], skip_special_tokens=True)
            )
            captions += [cur_caption]
            saved_frame_count += 1

        frame_count += 1

    cap.release()
    return json.dumps(captions)