Action Detection with CNN-GRU on MobileNetV2

Overview

This model performs human action classification on videos using a CNN-GRU architecture built on top of MobileNetV2 (1.0, 224) features and trained on the UCF101 dataset.
It is well-suited for recognizing actions from short trimmed video clips.

Model Details

Base model: google/mobilenet_v2_1.0_224
Architecture: CNN-GRU
Dataset: UCF101 - Action Recognition Dataset (https://www.kaggle.com/datasets/abdallahwagih/ucf101-videos)
Task: Video Classification (Action Recognition)
Metrics: Accuracy
License: MIT

Usage

Requirements

pip install torch torchvision opencv-python

Example Code

from action_model import load_action_model, preprocess_frames, predict_action
import cv2

# Load model
model = load_action_model(model_path="best_model.pt", device="cpu", num_classes=5)

# Read frames from video
cap = cv2.VideoCapture("path_to_video.mp4")
frames = []
while True:
    ret, frame = cap.read()
    if not ret:
        break
    frames.append(frame)
cap.release()

# Preprocess frames for model input
clip_tensor = preprocess_frames(frames[:16], seq_len=16, resize=(112,112))

# Predict action
result = predict_action(model, clip_tensor, device="cpu")
print(result)

Training & Evaluation

Trained on UCF101 split 1 with MobileNetV2 backbone.
Sequence length: 16 frames per clip.
Metric: Top-1 classification accuracy.

Intended Use & Limitations

Intended for:

Video analytics
Educational research
Baseline for video action recognition tasks

Limitations:

Predicts only UCF101 subset classes
Needs short, trimmed video clips
Not robust to out-of-domain videos or very low-res input

NanG01
/

Action_detection