Human-machine interaction could support many daily activities in making it more convenient. The development of smart devices has flourished the underlying smart systems that process smart and personalized control of devices. The first step in controlling any device is observation; through understanding the surrounding environment and human activity, a smart system can physically control a device. Human activity recognition (HAR) is essential in many smart applications such as self-driving cars, human-robot interaction, and automatic systems such as infrared (IR) taps. For human-centric systems, there are some requirements to perform a physical task in real-time. For human-machine interactions, the anticipation of human actions is essential. IR taps have delay limitations because of the proximity sensor that signals the solenoid valve only when the user’s hands are exactly below the tap. The hardware and electronics delay causes inconvenience in use and water waste. In this thesis, an alternative control based on deep learning action anticipation is proposed. Humans interact with taps for various tasks such as washing hands, face, brushing teeth, just to name a few. We focus on a small subset of these activities. Specifically, we focus on the activities carried out sequentially during an Islamic cleansing ritual called Wudu. Skeleton modality is widely used in HAR because of having abstract information that is scale-invariant and robust against imagery variances. We used depth cameras to obtain accurate 3D human skeletons of users performing Wudu. The sequences were manually annotated with ten atomic action classes. This thesis investigated the use of different Deep Learning networks with architectures optimized for real-time action anticipation. The proposed methods were mainly based on the Spatial-Temporal Graph Convolutional Network. With further improvements, we proposed a Gated Recurrent Unit (GRU) model with Spatial-Temporal Graph Convolution Network (ST-GCN) backbone to extract local temporal features. The GRU process the local temporal latent features sequentially to predict future actions. The proposed models scored 94.14% recall on binary classification to turn on and off the water tap. And higher than 81.58-89.08% recall on multiclass classification.
|Date made available
|KAUST Research Repository