A review on deep learning for vision-based hand detection, hand segmentation and hand gesture recognition in human-robot interaction - UTU Tutkimustietojärjestelmä

A2 Vertaisarvioitu katsausartikkeli tieteellisessä lehdessä

A review on deep learning for vision-based hand detection, hand segmentation and hand gesture recognition in human-robot interaction

Tekijät: Jalayer, Reza; Jalayer, Masoud; Orsenigo, Carlotta; Tomizuka, Masayoshi

Kustantaja: PERGAMON-ELSEVIER SCIENCE LTD

Julkaisuvuosi: 2026

Lehti: Robotics and Computer-Integrated Manufacturing

Artikkelin numero: 103110

Vuosikerta: 97

ISSN: 0736-5845

eISSN: 1879-2537

DOI: https://doi.org/10.1016/j.rcim.2025.103110

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Osittain avoin julkaisukanava

Verkko-osoite: https://doi.org/10.1016/j.rcim.2025.103110

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/500332936

Rinnakkaistallenteen lisenssi: CC BY

Rinnakkaistallennetun julkaisun versio: Kustantajan versio

Tiivistelmä

Hand-based analysis, including hand detection, segmentation, and gesture recognition, plays a pivotal role in enabling natural and intuitive human-robot interaction (HRI). Recent advances in vision-based deep learning (DL) have significantly improved robots' ability to interpret hand cues across diverse settings. However, previous reviews have not addressed all three tasks collectively or focused on recent DL architectures. Filling this gap, we review recent studies at the intersection of DL and hand-based interaction in HRI. We structure the literature around three core tasks, i.e. hand detection, segmentation, and gesture recognition, highlighting DL models, dataset characteristics, evaluation metrics, and key challenges for each. We further examine the application of these models across industrial, assistive, social, aerial, and space robotics domains. We identify the dominant role of Convolutional and Recurrent Neural Networks (CNNs and RNNs), as well as emerging approaches such as attention-based models (Transformers), uncertainty-aware models, Graph Neural Networks (GNNs), and foundation models, i.e. Vision-Language Models (VLMs) and Large Language Models (LLMs). Our analysis reveals gaps, including the scarcity of HRI-specific datasets, underrepresentation of multi-hand and multi-user scenarios, limited use of RGBD and multi-modal inputs, weak cross-dataset generalization, and inconsistent real-time benchmarking. Dynamic and long-range gestures, multi-view setups, and context-aware understanding also remain relatively underexplored. Despite these limitations, promising directions have emerged, such as multi-modal fusion, use of foundation models for intent reasoning, and the development of lightweight architectures for deployment. This review offers a consolidated foundation to support future research on robust and context-aware DL systems for hand-centric HRI.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

1-s2.0-S0736584525001644-main.pdf

Julkaisussa olevat rahoitustiedot:
The present study has been developed within the HumanTech Project, which is financed by the Italian Ministry of University and Research (MUR) for the 2023–2027 period as part of the ministerial initiative “Departments of Excellence” (L. 232/2016).