Kinetics Human Action Video Dataset is a large-scale video action recognition dataset released by Google DeepMind. It contains around 300,000 trimmed human action videos from 400 action classes. This year (2017), it served in the ActivityNet challenge as the trimmed video classification track. During our participation in the challenge, we have confirmed that our TSN framework published in ECCV 2016 works smoothly on Kinetics. Using the Inception V3 architecture, our single model with two streams reaches the top-1 accuracy of 76.6% on Kinetics. This result is achieved by only extracting 25 snippets from each video, while the DeepMind’s I3D models use all video frames for testing and get 74.1%.
It is also verified that TSN models learned on Kinetics can provide excellent pretraining for other related tasks such as untrimmed video classification and temporal action detection (SSN in ICCV2017).
Due to the huge volume of Kinetics, training action recognition models on Kinetics become really intensive for the academia. But we believe the benefit brought by Kinetics should not be limited to some rich labs or companies with lots of GPUs. In this sense, we release our action recognition models trained with TSN on the Kinetics dataset. For references, we also list the performance comparison of Kinetics and ImageNet pretrained models on two action understanding tasks, i.e. untrimmed video classification and temporal action detection using SSN.
Model weights and experimental results can be found on the [project website].
Some performance pictures here:
Kinetics Action Recognition
Temporal Action Detection with SSN