Refine
Document Type
- Conference Proceeding (20) (remove)
Has Fulltext
- no (20)
Is part of the Bibliography
- no (20)
Institute
We present a pipeline for recognizing dynamic freehand gestures on mobile devices based on extracting depth information coming from a single Time-of-Flight sensor. Hand gestures are recorded with a mobile 3D sensor, transformed frame by frame into an appropriate 3D descriptor and fed into a deep LSTM network for recognition purposes. LSTM being a recurrent neural model, it is uniquely suited for classifying explicitly time-dependent data such as hand gestures. For training and testing purposes, we create a small database of four hand gesture classes, each comprising 40 × 150 3D frames. We conduct experiments concerning execution speed on a mobile device, generalization capability as a function of network topology, and classification ability ‘ahead of time’, i.e., when the gesture is not yet completed. Recognition rates are high (>95%) and maintainable in real-time as a single classification step requires less than 1 ms computation time, introducing freehand gestures for mobile systems.
We present a system for 3D hand gesture recognition based on low-cost time-of-flight(ToF) sensors intended for outdoor use in automotive human-machine interaction. As signal quality is impaired compared to Kinect-type sensors, we study several ways to improve performance when a large number of gesture classes is involved. Our system fuses data coming from two ToF sensors which is used to build up a large database and subsequently train a multilayer perceptron (MLP). We demonstrate that we are able to reliably classify a set of ten hand gestures in real-time and describe the setup of the system, the utilised methods as well as possible application scenarios.
In this contribution we present a novel approach to transform data from time-of-flight (ToF) sensors to be interpretable by Convolutional Neural Networks (CNNs). As ToF data tends to be overly noisy depending on various factors such as illumination, reflection coefficient and distance, the need for a robust algorithmic approach becomes evident. By spanning a three-dimensional grid of fixed size around each point cloud we are able to transform three-dimensional input to become processable by CNNs. This simple and effective neighborhood-preserving methodology demonstrates that CNNs are indeed able to extract the relevant information and learn a set of filters, enabling them to differentiate a complex set of ten different gestures obtained from 20 different individuals and containing 600.000 samples overall. Our 20-fold cross-validation shows the generalization performance of the network, achieving an accuracy of up to 98.5% on validation sets comprising 20.000 data samples. The real-time applicability of our system is demonstrated via an interactive validation on an infotainment system running with up to 40fps on an iPad in the vehicle interior.
PROPRE is a generic and modular neural learning paradigm that autonomously extracts meaningful concepts of multimodal data flows driven by predictability across modalities in an unsupervised, incremental and online way. For that purpose, PROPRE consists of the combination of projection and prediction. Firstly, each data flow is topologically projected with a self-organizing map, largely inspired from the Kohonen model. Secondly, each projection is predicted by each other map activities, by mean of linear regressions. The main originality of PROPRE is the use of a simple and generic predictability measure that compares predicted and real activities for each modal stream. This measure drives the corresponding projection learning to favor the mapping of predictable stimuli across modalities at the system level (i.e. that their predictability measure overcomes some threshold). This predictability measure acts as a self-evaluation module that tends to bias the representations extracted by the system so that to improve their correlations across modalities. We already showed that this modulation mechanism is able to bootstrap representation extraction from previously learned representations with artificial multimodal data related to basic robotic behaviors [1] and improves performance of the system for classification of visual data within a supervised learning context [2]. In this article, we improve the self-evaluation module of PROPRE, by introducing a sliding threshold, and apply it to the unsupervised classification of gestures caught from two time-of-flight (ToF) cameras. In this context, we illustrate that the modulation mechanism is still useful although less efficient than purely supervised learning.
Touch versus mid-air gesture interfaces in road scenarios-measuring driver performance degradation
(2016)
We present a study aimed at comparing the degradation of the driver's performance during touch gesture vs mid-air gesture use for infotainment system control. To this end, 17 participants were asked to perform the Lane Change Test. This requires each participant to steer a vehicle in a simulated driving environment while interacting with an infotainment system via touch and mid-air gestures. The decrease in performance is measured as the deviation from an optimal baseline. This study concludes comparable deviations from the baseline for the secondary task of infotainment interaction for both interaction variants. This is significant as all participants are experienced in touch interaction, however have had no experience at all with mid-air gesture interaction, favoring mid-air gestures for the long-term scenario.
We present a publicly available benchmark database for the problem of hand posture recognition from noisy depth data and fused RGB-D data obtained from low-cost time-of-flight (ToF) sensors. The database is the most extensive database of this kind containing over a million data samples (point clouds) recorded from 35 different individuals for ten different static hand postures. This captures a great amount of variance, due to person-related factors, but also scaling, translation and rotation are explicitly represented. Benchmark results achieved with a standard classification algorithm are computed by cross-validation both over samples and persons, the latter implying training on all persons but one and testing on the remaining one. An important result using this database is that cross-validation performance over samples (which is the standard procedure in machine learning) is systematically higher than cross-validation performance over persons, which is to our mind the true application-relevant measure of generalization performance.
Given the success of convolutional neural networks (CNNs) during recent years in numerous object recognition tasks, it seems logical to further extend their applicability to the treatment of three-dimensional data such as point clouds provided by depth sensors. To this end, we present an approach exploiting the CNN’s ability of automated feature generation and combine it with a novel 3D feature computation technique, preserving local information contained in the data. Experiments are conducted on a large data set of 600.000 samples of hand postures obtained via ToF (time-of-flight) sensors from 20 different persons, after an extensive parameter search in order to optimize network structure. Generalization performance, measured by a leave-one-person-out scheme, exceeds that of any other method presented for this specific task, bringing the error for some persons down to 1.5 %.
We present a novel method to perform multi-class pattern classification with neural networks and test it on a challenging 3D hand gesture recognition problem. Our method consists of a standard one-against-all (OAA) classification, followed by another network layer classifying the resulting class scores, possibly augmented by the original raw input vector. This allows the network to disambiguate hard-to-separate classes as the distribution of class scores carries considerable information as well, and is in fact often used for assessing the confidence of a decision. We show that by this approach we are able to significantly boost our results, overall as well as for particular difficult cases, on the hard 10-class gesture classification task.
We present a light-weight real-time applicable 3D-gesture recognition system on mobile devices for improved Human-Machine Interaction. We utilize time-of-flight data coming from a single sensor and implement the whole gesture recognition pipeline on two different devices outlining the potential of integrating these sensors onto mobile devices. The main components are responsible for cropping the data to the essentials, calculation of meaningful features, training and classifying via neural networks and realizing a GUI on the device. With our system we achieve recognition rates of up to 98% on a 10-gesture set with frame rates reaching 20Hz, more than sufficient for any real-time applications.