Detecting sign language with a smart glove

Communicate gestures and hand motions to people or robots, by wearing a glove with machine learning and soft sensors

video: overview | video: conference presentation | sensors and electronics | gesture detection | publications

Collaborators: Josie Hughes, Matteo D’Aria,
Marco de Fazio, Daniela Rus

Across cultures and individuals, nonverbal aspects of communication can be as important as words themselves. They can augment and enrich the spoken word, modify subtle meanings, or even replace words altogether such as when using sign language. And yet these physical aspects can be hard for current robots or computers to detect, making them less intuitive and less accessible to all users.

Taking a step towards smarter wearable devices that can understand more about a person’s actions, this project presents a smart glove with soft pose sensors, motion sensors, and on-board machine learning. It represents a self-contained wearable system that can sense hand pose and motion, and detect sign language in real time.

For more information, check out the virtual presentation below

The system is based on a commercially available conductive glove, and embeds machine learning in a small wearable microcontroller. This aims to help address remaining challenges in smart wearables including scalable, accessible fabrication and real-time processing of high-dimensional streaming data.

The self-contained wearable system is based on a commercially available conductive knit glove. Wires are added to the back to create strain sensors. The board contains an accelerometer, and performs all signal processing, real-time neural network evaluations, and communications.
*Photos by Joseph DelPreto, MIT CSAIL*

Sensorizing a glove: Soft pose sensing using a conductive knit

We use a commercially available knitted conductive glove designed to work with capacitive touch screens – the Original Sport glove by Agloves. Adding electrode connections to this off-the-shelf glove enables rapid creation. Due to the silver threads within the knit, the glove is electrically conductive. Its resistance changes when the material is stretched, such as when a finger bends, which provides information about hand motion.

To enable hand pose identification, we use this conductive glove to form strain sensors spanning all joints of the hand. A connection is made by simply weaving approximately 2 cm of stripped wire through the knit in a rough loop. Measuring the resistance between two such connections forms a strain sensor due to the knit’s strain-sensitive response. To simplify processing while still detecting motions of all joints, we only considering the resistance between each point and a common ground. Connections are placed on each finger segment to maximize the information content.

Wires attached to the microcontroller are long enough to not hinder finger motion. In the future, they could also be hidden and protected by a non-conductive glove layer, such as wearing a latex glove over the smart glove.

Wires are connected to 17 points on the back of the hand to form strain sensors on each finger segment and the hand, including a common ground. An accelerometer is also on the PCB on the back of the hand. (a) depicts the locations schematically, while (b) shows the connection method.
*Photos by Joseph DelPreto, MIT CSAIL*

The response of the glove material was characterized using 80 cycles of straining either along or across the knit. Adapted from this paper.

Electronics

The glove electronics incorporate the strain sensor reads-outs, an ST microcontroller, and a 3-axis accelerometer. At the heart of the board is an STM32H7 microcontroller, which performs data acquisition, signal processing, and real-time neural network evaluation. Each of the 16 strain sensors are iteratively connected to a constant current source to measure its resistance. An on-board neural network processes these readings to predict gestures. The board contains Bluetooth to wirelessly communicate results, although the current version uses USB.

Gesture vocabulary: American Sign Language (ASL)

The chosen vocabulary consists of poses and gestures representing 24 letters, words, or phrases of ASL. This highlights the capabilities of the sensorized glove to detect both static and dynamic gestures, and demonstrates a potential application that could make human interactions more natural by translating ASL into text or speech in real time.

A vocabulary of 24 ASL letters and words was selected, which requires identifying a range of poses and dynamic motions. This yields an informative corpus for evaluating the combination of strain-based pose information and accelerometer-based motion information featured by the embedded glove system.
*Photos by Joseph DelPreto, MIT CSAIL*

ASL naturally showcases the necessity of detecting both hand pose and motion. Certain pairs of signs, like I and J or A and Sorry, have the same pose but different dynamics. Other sets, like Eat, Home, and Thank You, have subtle differences in hand poses, orientations, or motion directions. Some gestures such as Please or Yes are periodic motions that could have varying numbers of repetitions. Most of the letters are static poses without motion. Altogether, the chosen vocabulary probes the system’s ability to combine pose and motion information for multi-class gesture detection.

Training a neural network

Data was recorded continuously throughout 7 sessions of using the glove, and each of the 24 signs was made 10 times in each experiment. To create training data for the classifier, recorded data is segmented into 2-second labeled examples, conditioned, and transformed into feature vectors. Data augmentation is also used to improve the robustness and accuracy of the trained network, especially when applied to real-time streaming data.

The data is processed to smooth and normalize the signals, which helps emphasize useful information rather than noise in the signals. Each strain channel is smoothed by a moving mean with a trailing window spanning 0.1 seconds (10 timesteps) to remove high-frequency noise or outliers. Then, to make the classifier robust to short-term or long-term drift in the strain sensors while also avoiding calibration routines, the strain values are dynamically normalized on a rolling basis. For each 2-second window, the minimum and maximum values across all strain channels are computed, and then all values are shifted down by this minimum and scaled by this range. The new strain values in each window will thus be between 0 and 1. Jointly shifting and scaling all channels by the same factor preserves the relative magnitudes between channels. Computing these offsets and factors on a rolling basis can accommodate drifts throughout experiments or across days due to effects such as the glove’s hysteresis or fit on the hand, while avoiding tuned factors or dedicated calibration periods.

Data augmentation was also used to make the neural network more robust to people making gestures at different times or with different speeds. Time-shifted synthetic examples shift training examples left and right to encourage the network to accommodate examples that are not perfectly centered in its classification window. Time-scaled synthetic examples compress or dilate time to simulate slower or faster gestures.

The network is a long short-term memory (LSTM) recurrent neural network. Since LSTMs have feedback connections to process sequences of data, they are well-suited to our task of classifying poses and motions. The network accepts a 200 x 19 feature matrix representing a sequence of strain and accelerometer readings. It then has a single LSTM layer of size 100, a 20% dropout layer, and a dense output layer with softmax activations. The output has 25 classes: the 24 letters and words, and a baseline class representing that no gesture is being made.

To train and evaluate the network, we use a leave-one-experiment-out 7-fold cross validation strategy for training and evaluation. All examples from an experiment are used as the test set, such that the network will be tested on data from an episode of wearing the glove that did not influence the training at all. Using each experiment as a test set instead of using randomized k-fold cross validation helps avoid data leakage between training and testing sets, since data within a session is likely correlated along such aspects as user behavior or glove properties. The selected procedure aims for a more robust evaluation by simulating performance that would be expected on a new day of using the glove without network retraining.

Detect sign language!

Segmented examples: When tested on sessions of using the glove that the network had never seen during training, accuracy averaged 96.3% on segmented examples. Removing either type of data augmentation decreased the accuracy, as did removing either strain sensing or acceleration sensing. This indicates that the network learned to use both pose and motion as desired, and that data augmentation improved reliability.

Streaming classifications: In addition to testing on segmented examples, the network also operated continuously on streaming data at 5 Hz. When a user makes a gesture, the streaming predictions would ideally create a single pulse of correct labels lasting one or more timesteps. But in reality, the network might output 0, 1, or many predictions while the person is making a single gesture. To assess this, we compare rising edges of the predicted label sequence with the sequence of ground truth cues. Averaging across all holdout experiments, the filtered networks made a single, correct rising-edge prediction during 91.2% of the cued windows. Multiple predictions, all for the correct gesture, were made during 1.6% of the cues. There were no trials in which only incorrect predictions were made, although 4.1% of the trials had both correct and incorrect predictions. 3.1% of trials were missed altogether.

The results are promising for successful real-time gesture detection, and indicate that the network was successfully deployed on a small wearable microcontroller to create a self-contained smart wearable system. Future experiments can further investigate the performance using more users, longer durations, and more gestures.

Conference Presentation: International Conference on Intelligent Robots and Systems (IROS 2022)

Publications

J. DelPreto, J. Hughes, M. D’Aria, M. de Fazio, and D. Rus, “A Wearable Smart Glove and Its Application of Pose and Gesture Detection to Sign Language Classification,” IEEE Robotics and Automation Letters (RA-L), vol. 7, iss. 4, 2022. doi:10.1109/LRA.2022.3191232
[BibTeX] [Abstract] [Download PDF]

Advances in soft sensors coupled with machine learning are enabling increasingly capable wearable systems. Since hand motion in particular can convey useful information for developing intuitive interfaces, glove-based systems can have a significant impact on many application areas. A key remaining challenge for wearables is to capture, process, and analyze data from the high-degree-of-freedom hand in real time. We propose using a commercially available conductive knit to create an unobtrusive network of resistive sensors that spans all hand joints, coupling this with an accelerometer, and deploying machine learning on a low-profile microcontroller to process and classify data. This yields a self-contained wearable device with rich sensing capabilities for hand pose and orientation, low fabrication time, and embedded activity prediction. To demonstrate its capabilities, we use it to detect static poses and dynamic gestures from American Sign Language (ASL). By pre-training a long short-term memory (LSTM) neural network and using tools to deploy it in an embedded context, the glove and an ST microcontroller can classify 12 ASL letters and 12 ASL words in real time. Using a leave-one-experiment-out cross validation methodology, networks successfully classify 96.3% of segmented examples and generate correct rolling predictions during 92.8% of real-time streaming trials.

@article{delpretoHughes2022smartGlove,
title={A Wearable Smart Glove and Its Application of Pose and Gesture Detection to Sign Language Classification},
author={DelPreto, Joseph and Hughes, Josie and D'Aria, Matteo and de Fazio, Marco and Rus, Daniela},
journal={IEEE Robotics and Automation Letters (RA-L)},
organization={IEEE},
year={2022},
month={October},
volume={7},
number={4},
doi={10.1109/LRA.2022.3191232},
url={https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9830849},
abstract={Advances in soft sensors coupled with machine learning are enabling increasingly capable wearable systems. Since hand motion in particular can convey useful information for developing intuitive interfaces, glove-based systems can have a significant impact on many application areas. A key remaining challenge for wearables is to capture, process, and analyze data from the high-degree-of-freedom hand in real time. We propose using a commercially available conductive knit to create an unobtrusive network of resistive sensors that spans all hand joints, coupling this with an accelerometer, and deploying machine learning on a low-profile microcontroller to process and classify data. This yields a self-contained wearable device with rich sensing capabilities for hand pose and orientation, low fabrication time, and embedded activity prediction. To demonstrate its capabilities, we use it to detect static poses and dynamic gestures from American Sign Language (ASL). By pre-training a long short-term memory (LSTM) neural network and using tools to deploy it in an embedded context, the glove and an ST microcontroller can classify 12 ASL letters and 12 ASL words in real time. Using a leave-one-experiment-out cross validation methodology, networks successfully classify 96.3% of segmented examples and generate correct rolling predictions during 92.8% of real-time streaming trials.}
}