|
--- |
|
license: mit |
|
datasets: |
|
- abdallahwagih/ucf101-videos |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- google/mobilenet_v2_1.0_224 |
|
pipeline_tag: video-classification |
|
|
|
tags: |
|
- action-recognition |
|
- cnn-gru |
|
- video-classification |
|
- ucf101 |
|
- action |
|
- mobilenetv2 |
|
- deep-learning |
|
- pytorch |
|
--- |
|
|
|
# Action Detection with CNN-GRU on MobileNetV2 |
|
|
|
## Overview |
|
|
|
This model performs human action classification on videos using a CNN-GRU architecture built on top of **MobileNetV2 (1.0, 224)** features and trained on the [UCF101](https://www.kaggle.com/datasets/abdallahwagih/ucf101-videos) dataset. |
|
It is well-suited for recognizing actions from short trimmed video clips. |
|
|
|
*** |
|
|
|
## Model Details |
|
|
|
- **Base model:** `google/mobilenet_v2_1.0_224` |
|
- **Architecture:** CNN-GRU |
|
|
|
 |
|
|
|
- **Dataset:** UCF101 - Action Recognition Dataset (https://www.kaggle.com/datasets/abdallahwagih/ucf101-videos) |
|
- **Task:** Video Classification (Action Recognition) |
|
- **Metrics:** Accuracy |
|
- **License:** MIT |
|
|
|
*** |
|
|
|
## Usage |
|
|
|
### Requirements |
|
|
|
```bash |
|
pip install torch torchvision opencv-python |
|
``` |
|
|
|
### Example Code |
|
|
|
```python |
|
from action_model import load_action_model, preprocess_frames, predict_action |
|
import cv2 |
|
|
|
# Load model |
|
model = load_action_model(model_path="best_model.pt", device="cpu", num_classes=5) |
|
|
|
# Read frames from video |
|
cap = cv2.VideoCapture("path_to_video.mp4") |
|
frames = [] |
|
while True: |
|
ret, frame = cap.read() |
|
if not ret: |
|
break |
|
frames.append(frame) |
|
cap.release() |
|
|
|
# Preprocess frames for model input |
|
clip_tensor = preprocess_frames(frames[:16], seq_len=16, resize=(112,112)) |
|
|
|
# Predict action |
|
result = predict_action(model, clip_tensor, device="cpu") |
|
print(result) |
|
``` |
|
|
|
*** |
|
|
|
## Training & Evaluation |
|
|
|
- Trained on UCF101 split 1 with MobileNetV2 backbone. |
|
- Sequence length: 16 frames per clip. |
|
- Metric: Top-1 classification accuracy. |
|
|
|
*** |
|
|
|
## Intended Use & Limitations |
|
|
|
**Intended for:** |
|
- Video analytics |
|
- Educational research |
|
- Baseline for video action recognition tasks |
|
|
|
**Limitations:** |
|
- Predicts only UCF101 subset classes |
|
- Needs short, trimmed video clips |
|
- Not robust to out-of-domain videos or very low-res input |
|
|
|
*** |
|
|
|
## Tags |
|
|
|
`action` 路 `cnn-gru` 路 `video-classification` 路 `ucf101` 路 `mobilenetv2` 路 `deep-learning` 路 `torch` |