Detailed Notes on Real Time Hand Gesture Recognition Project

Team Introduction: Aditi, Gloria, Kunal
Focus: Exploring, experimenting, debugging for a final project on real-time hand gesture recognition.
Scope of the Presentation:
- Review original research.
- Discuss dataset utilized.
- Improvements made to the model.
- Results achieved.
- Real world applications and ethical considerations in AI.

Significance: Vital tool for enhancing human-computer interaction in various applications:
- Virtual reality navigation
- Surgical robot control
- Sign language interpretation
Challenges in Accuracy:
- Dynamic Nature of Gestures: Variance in gestures among individuals (e.g., shape, speed, angle).
- Real Time Performance Requirement: Latency issues can disrupt user experience or hinder assistive technologies.
- Environmental Variability: Impact of lighting, background movement, and occlusions on model performance.

Question Addressed: How can we improve accuracy and generalizability of hand gesture recognition models using accessible resources while ensuring efficiency for real-time use?
Reference Research: Influential paper by NVIDIA (02/2016): "Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D CNNs."
Modeling Framework: Two-stage deep learning framework:
- Gesture Detector: Shallow 3D CNN for gesture detection in video streams.
- Gesture Classifier: Deep ResNEXT101 network for gesture classification.
- Key Feature: Early classification allows gesture detection before completion.

Datasets: Eco Gesture dataset and NVIDIA dynamic hand gesture dataset with thousands of samples.
Performance Metrics: Model achieved over 90% classification accuracy and reduced inference times.
Limitations Noted: Extensive computational resources and unavailable full source code.

Baseline Model: Initial setup using IPN hand dataset with over 4,000 labeled gestures (video segments paired with annotations).
Early Model Limitations: No gesture detection stage or data augmentation; struggled with overfitting and lacked temporal modeling.
Data Augmentation Strategy: Used Keras Image Data Generator for applying transformations such as:
- Random rotations.
- Brightness variation.
- Horizontal flips.
- Width and height shifts.

Training Accuracy Results:
- Baseline accuracy: 49%
- After augmentation: 53.4% with validation accuracy growing from 50% to 57.7%.
Stability Observed: Improved stability in training loss curves indicated better learning without mere memorization of training data.
Goal Achieved: Addressed overfitting and improved generalization without altering model architecture.

Strengths:
- Easy implementation, no structural changes needed.
- Enhanced model generalization and reduced overfitting, especially in early training stages.
Limitations:
- Cannot substitute for complex temporal modeling necessary for capturing motion.
- Accuracy plateaued with excessive augmentation.

Privacy Concerns: Use of always-on cameras raises surveillance issues.
Bias Risk: Inadequate diversity in datasets may lead to unequal performance among different users.
Potential Misuse: Spoofing gestures to trigger unintended actions poses risks.
Environmental Effects: Real-time processing can increase carbon emissions and device overheating.

Technical Insights:
- Smooth setup of CNN and data loader in Google Colab.
- Simple data augmentation implementation yielded significant results improvement.
Challenges Faced:
- Raw data matching and label formatting required debugging.
Future Directions:
- Explore full video clips and more advanced architectures (e.g., 3D CNNs, LSTM).
- Work towards real-time systems operational on low-cost devices.

Project Results:
- Enhanced 2D CNN with data augmentation outperforming baseline model on IPN hand dataset.
- Observed stability in training and reduced overfitting while utilizing lightweight approaches.
Future Goals:
- Assess capabilities of advanced architectures and optimization techniques for improved gesture recognition solutions.