C3D Superlite for Real Time Gesture Recognition

A UTMIST Project by Mustafa Khan, Chaojun Chen, Yiyang Tang, Charles Yuan, and Yue Zhuang

University of Toronto Machine Intelligence Team

6 min readSep 11, 2021

Introduction

Hand gesture recognition seeks to classify the purposeful movement of human hands. Gesture recognition can be used in human-computer interactions such as controlling web pages, games and performing robot control. The challenge of this task is conducting recognition with accuracy, integrating gesture recognition capabilities into existing systems and conducting the recognition in real time. The team compared the performance of two architectures Cov3D and C3D Superlite; and proposed a lightweight gesture recognition system (Augma) that addresses these three opportunities using 3D convolutions in order to achieve near state-of-the-art performance on the 20bn-jester validation set.

Related Work

Convolutional Neural Networks (CNNs) achieve state-of-the-art performance on object detection and classification tasks when applied to static images [1]. They have also been extended successfully to recognition tasks on video data [2].

There are various approaches using CNNs to extract spatio-temporal information from video data. One of the approaches is 2D CNN. 2D CNNs are used to treat video frames as multi-channel inputs for action classification. The second approach is 3D CNNs. 3D CNN uses 3D convolutions and 3D pooling to capture discriminative features along both spatial and temporal dimensions [3]. The third approach Temporal Segment Network (TSN) divides video data into segments and extracts information from color and optical flow modalities for action recognition. Recently, Temporal Relation Network (TRN) builds on top of TSN to investigate temporal dependencies between video frames at multiple time scales [4].

Network Architectures

This paper implements two architectures. The first architecture, Conv3D, tests the limits of 3D CNNs by attempting to conduct hand gesture recognition using layers of 3D convolutions. The second architecture, C3D Superlite, is inspired by [5] where the model architecture extracts features from video frames by a CNN and applies a LSTM for global temporal modelling. These two approaches are tested in this paper.

Figure 1. An illustration of 3D convolutions

Conv3D

Conv3D net has 4 convolutional layers followed by batch normalization, ELU activation and pooling layers. All 3D convolution kernels are 3 × 3 × 3 with stride 1 in both spatial and temporal dimensions. The 3D pooling layers are denoted from Pool 1 to Pool 4. All pooling kernels have size 2 × 2 × 2, except for Pool 1 which is 1 × 2 × 2. The first fully connected layer takes in 12800 inputs and outputs 512 units while the final layer takes in 512 inputs and has 6 outputs, which corresponds to the number of classes being used.

C3D Superlite

C3D Superlite net has 6 convolutional layers, 4 max-pooling layers, 1 LSTM and 1 fully connected layer, followed by a softmax output layer. All 3D convolutional kernels have a size of 3 × 3 × 3. Conv 1 has a stride of 1 × 2 × 2 while the remaining convolutional layers have a stride of 1 in both spatial and temporal dimensions. The number of filters per layer are denoted in each box in Figure 2. The 3D pooling layers are labelled from pool1 to pool5. The pooling kernel of pool1 is 1 × 2 × 2, for pool2 is 2 × 2 × 2, for pool3 is 3 × 2 × 2 and for pool4 is 2 × 2 × 2. The stride of each pooling layer is 2 × 2 × 2 except for pool1 which is 1 × 2 × 2. The output of the final pooling layer is reshaped into a 9 × 384 layer followed by L2 normalization. This is then passed into an LSTM and a fully connected layer each with 512 output units. The output of the final layer corresponds to the number of classes being trained on.

Implementation

Dataset

The dataset we used, the 20bn-jester dataset, is a large collection of labelled video clips that show humans performing pre-defined hand gestures in front of a laptop. There are 118,562 videos in the training set and 14,787 videos in the validation set.

*Figure 5. A preprocessed example video using our pipeline.*

Training

We use Adaptive Moment Estimation (Adam) to train our network. We generate training samples from videos in training data to perform data augmentation. This includes converting RGB frames to grayscale, unifying the number frames, resizing the image to a size of 112 × 112 and performing mean subtraction for each sample. Note that we unify the frames by generating 30 frame clips about selected temporal positions. If the videos are shorter than 30 frames, we loop the videos as many times as necessary.

To train the network we employ Sparse Categorical Cross Entropy for the loss and have a mini-batch size of 32 on 1 CPU, a learning rate of 0.01 and no momentum. Since the goal of training the hand gesture recognition system was to deploy it to applications, the model was trained on the 5 classes of ‘Swiping Up’, ‘Swiping Down’, ‘Swiping Left’, ‘Swiping Right’, ‘No gesture’ and ‘Doing other things’ from the 20bn-jester training set that had the widest potential application in gesture-controlled systems. There were 6000 training videos used with 1000 videos per class and 750 validation videos used from the 20bn-jester validation set with 150 videos per class. In preliminary experiments on the ActivityNet dataset, large learning rate and batch size was important to achieve good recognition performance.

Discussion

The team recognizes gestures in videos using the trained model. The team adopts the sliding window approach to generate input clips. In other words, each video is split into non-overlapped 30 frame clips. Each clip is converted from RGP to grayscale and resized to satisfy the 30 × 112 × 112 × 1 input shape to the network. The team estimates class probabilities of each clip using the trained model and return the best predicted class to recognize hand gestures in videos.

Results

Between Conv3D and C3D Superlite, the results indicated that C3D superlite had much more balanced results and was capable of generalizing better than Conv3D with a +5% increase in validation accuracy and a +2 FPS increase in the inference speed. These results can be seen in Table 1 below.

*Table 1. Comparing Conv3D with C3D Superlite.*

A more detailed breakdown of the C3D Superlite training and validation results can be seen below in Table 2. The weights of the trained model are publically available in our github repository.

*Table 2. C3D Superlite Training Metrics.*

Accuracy, Loss, and Weight Activations

*Figure 6. The training (in orange) and validation (in blue) accuracy of the c3d_superlite model.*

*Figure 7. The training (in orange) and validation (in blue) loss of the c3d_superlite model.*

Figure 8. **Histogram of Weight Activations In Each Layer**

Works Cited

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[2] C. Feichtenhofer, A. Pinz, and R. P. Wildes. Spatiotemporal residual networks for video action recognition. In Advances in Neural Information Processing Systems (NIPS), pages 3468– 3476, 2016.

[3] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, pages 20–36. Springer, 2016.

[4] B. Zhou, A. Andonian, and A. Torralba. Temporal relational reasoning in videos. arXiv preprint arXiv:1711.08496, 2017.

[5] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.