NanG01
/

Action_detection

Video Classification

action-recognition

Model card Files Files and versions

Action_detection / README.md

NanG01's picture

Updated with architecture

0db24a7 verified 8 days ago

|

history blame contribute delete

2.3 kB

	---
	license: mit
	datasets:
	- abdallahwagih/ucf101-videos
	metrics:
	- accuracy
	base_model:
	- google/mobilenet_v2_1.0_224
	pipeline_tag: video-classification

	tags:
	- action-recognition
	- cnn-gru
	- video-classification
	- ucf101
	- action
	- mobilenetv2
	- deep-learning
	- pytorch
	---

	# Action Detection with CNN-GRU on MobileNetV2

	## Overview

	This model performs human action classification on videos using a CNN-GRU architecture built on top of MobileNetV2 (1.0, 224) features and trained on the [UCF101](https://www.kaggle.com/datasets/abdallahwagih/ucf101-videos) dataset.
	It is well-suited for recognizing actions from short trimmed video clips.

	***

	## Model Details

	- Base model: `google/mobilenet_v2_1.0_224`
	- Architecture: CNN-GRU

	![CNN-GRU Architecture](./cnn_architecture.png)

	- Dataset: UCF101 - Action Recognition Dataset (https://www.kaggle.com/datasets/abdallahwagih/ucf101-videos)
	- Task: Video Classification (Action Recognition)
	- Metrics: Accuracy
	- License: MIT

	***

	## Usage

	### Requirements

	```bash
	pip install torch torchvision opencv-python
	```

	### Example Code

	```python
	from action_model import load_action_model, preprocess_frames, predict_action
	import cv2

	# Load model
	model = load_action_model(model_path="best_model.pt", device="cpu", num_classes=5)

	# Read frames from video
	cap = cv2.VideoCapture("path_to_video.mp4")
	frames = []
	while True:
	ret, frame = cap.read()
	if not ret:
	break
	frames.append(frame)
	cap.release()

	# Preprocess frames for model input
	clip_tensor = preprocess_frames(frames[:16], seq_len=16, resize=(112,112))

	# Predict action
	result = predict_action(model, clip_tensor, device="cpu")
	print(result)
	```

	***

	## Training & Evaluation

	- Trained on UCF101 split 1 with MobileNetV2 backbone.
	- Sequence length: 16 frames per clip.
	- Metric: Top-1 classification accuracy.

	***

	## Intended Use & Limitations

	Intended for:
	- Video analytics
	- Educational research
	- Baseline for video action recognition tasks

	Limitations:
	- Predicts only UCF101 subset classes
	- Needs short, trimmed video clips
	- Not robust to out-of-domain videos or very low-res input

	***

	## Tags

	`action` · `cnn-gru` · `video-classification` · `ucf101` · `mobilenetv2` · `deep-learning` · `torch`