Event

PhD Defense: From 2D Dense to 3D Sparse Trajectories for Human Action Detection and Recognition

  • Conférencier  Konstantinos Papadopoulos

  • Lieu

    Please register via the link in order to get access!

    https://unilu.webex.com/unilu/onstage/g.php?MTID=e5041ea5e136419501fe76b6058c396f5

    LU

Members of the defense committee:

  • Chairman: Assist. Prof. Radu State, University of Luxembourg
  • Deputy Chairman: Prof. Dr Björn Ottersten, University of Luxembourg
  • Supervisor: Doctor Djamila Aouada, University of Luxembourg
  • Member: Prof. Stefano Berretti, University of Florence, Italy
  • Member: Dr François Bremond, INRIA, Sophia Antipolis, Nice, France

Abstract

Human action recognition has been an active research topic in the field of computer vision and has attracted interest in multiple applications such as video surveillance, home-based rehabilitation, and human-computer interaction. In the literature, to model motion, trajectories have been widely employed given their effectiveness. There are different variants of trajectory-based representations. Among the most successful ones, one can refer to the dense trajectories, commonly extracted from an RGB stream using optical flow, and the sparse trajectories from either 2D or 3D skeleton joints, usually provided by 3D sensors. Although dense and sparse trajectory-based approaches have shown great performance, each of them presents different shortcomings. Despite their ability to track subtle motion with precision, dense trajectories are sensitive to noise and irrelevant background motion and lack locality awareness. Furthermore, due to their 2D nature, dense trajectories show limited performance in the presence of radial motion. Sparse trajectories, on the other hand, form a high-level and compact representation of human motion which is widely adopted in action recognition. However, they are barely applicable in real-life scenarios due to limitations coming from 3D sensors, such as close range requirements and sensitivity to outdoor illumination.

In this thesis, we propose to overcome the aforementioned issues by exploring and extending both representations; thus, going from 2D dense to 3D sparse trajectories. In the first part of this thesis, we combine both dense and sparse representations. First, we introduce Localized Trajectories which endow dense trajectories a local description power by clustering motion trajectories around human body joints and then encoding them using local Bag-of-Words. We also revisit action detection by exploiting dense trajectories and skeleton features in an alternative way. Moreover, for a better description of radial motion, we extend Localized Trajectories to 3D by computing the scene flow from the depth modality.

In the second part of this thesis, we focus on representations purely based on 3D sparse trajectories. To overcome the limitations presented by 3D sensors, we exploit the advances in 3D pose estimation from a single RGB camera to generate synthetic sparse trajectories. Instead of relying on a traditional skeleton alignment, virtual viewpoints are used to augment the viewpoint variability in the training data. Nevertheless, the estimated 3D skeletons present naturally a higher amount of noise than the ones acquired using 3D sensors. For that reason, we introduce a network that implicitly smooths skeleton joint trajectories in an end-to-end manner. The successful Spatial Temporal Graph Convolutional Network (ST-GCN) which exploits effectively the graph structure of skeleton sequences is jointly used for recognizing the actions. However, raw skeleton features are not informative enough for such networks and important temporal dependencies are ignored. Therefore, we extend the ST-GCN by introducing two novel modules. The first module learns appropriate vertex features by encoding raw skeleton data into a new feature space. The second module uses a hierarchical dilated convolutional network for capturing both short-term and long-term temporal dependencies. Extensive experiments and analyses are conducted for validating all of our contributions showing their effectiveness with respect to the state-of-the-art.