Analysis of Video Classification With Convolutional Neural Networks
Bhagya Laxmi behera, Mihrnarayan Mohanty
Convolutional Neural Networks (CNNs) have been established as a powerful class of models for video and image recognition problems. The most popular convolution neural networks (CNN) for object detection and object category classification from images are ImageNet, Alex Nets, GoogLeNet, and ResNet. A large area of the improvements in video classification had to do with the
work done by the image classification community and the use of deep convolutional neural networks (CNNs) which produce competitive results with handcrafted motion features. This paper study multiple approaches for extending the connectivity of a CNN in the time domain to take advantage of local spatio-temporal information and suggest a multi-resolution, foveated architecture as a promising way of speeding up the training. In the context of large scale video processing, training CNNs on video frames is extremely time consuming, due to the large number of frames involved. This paper proposes to avoid this problem by training CNNs on either YouTube video or Flickr images and then using these networks' outputs as features for other higher-level classifiers. This paper presents an adaptive method to determine the temporal size for network input based on optical flow energy and develop a volumetric pyramid pooling layer to deal with input clips of arbitrary sizes.