Date Log
HJFormer: Hand Joints Directed Transformer for 3D Hand Pose Estimation
MUST JOURNAL OF RESEARCH AND DEVELOPMENT,
Vol. 2 No. 4 (2023): HJFormer: Hand Joints Directed Transformer for 3D Hand Pose Estimation
Abstract
Hand pose estimation (HPE) is a challenging structured data sequence modelling task. Most existing methods for HPE only consider individual joint regression heatmaps, which require localization and recognition of the hand key joints through a strong hand detector. Unfortunately, this causes 3D pose estimation to fail in difficult cases such as joints overlapping and poses fast-changing, as hand detectors cannot exploit fine-grained hand joint priors in pose estimation. It also adds a computational burden to the system. This paper presents an effective HJFormer that coherently exploits high-order joints and their relevance to increase the performance of the system. Our proposed HJFormer eliminates the need for fitting the hand detector at the input pipeline, which requires both carefully designed fitting functions and complex algorithms. Instead, a modernised CNN architecture such as Residual Network (ResNet) is applied to upgrade the transformer-based architecture and speed up the architecture backbone by encoding the input depth image to produce individual hand frame embeddings. To acquire the full hand poses, a transformer module is trained to learn the temporal joint features in time and later utilise this information to produce the 2D hand joint location. The estimated 2D hand joint mutual dependencies are then regressed into different 3D poses via the regression layer to obtain the final hand pose. We compare our method with other CNN-based 3D hand pose estimation methods trained on the ICVL, NYU, and MSRA datasets. The results demonstrate that the proposed method achieves superior performance with an average proportion of about 92% of correct frames at the error threshold of 40mm and an average error distance of about 7.5mm in the ICVL, MSRA, and NYU datasets. This implies that our method can achieve state-of-the-art accuracy using less computational power than traditional hand detector-based systems.
Keywords
Download Citation
Endnote/Zotero/Mendeley (RIS)BibTeX