depth-based3dhandposeestimation_如何根据tof训练手势识别模型资源-CSDN文库

需积分: 40 44 浏览量 2019-01-07 15:15:56 上传评论收藏 792KB PDF 举报

在深度学习和计算机视觉领域，基于深度图的3D手势识别技术是一种前沿的研究方向。这项技术利用深度信息捕捉用户的三维手势动作，从而让计算机能够理解并响应手势指令。这项技术的实现通常依赖于深度摄像头，它可以提供不同深度层面上的图像信息，这对于识别立体空间中的手势至关重要。标题“depth-based 3d hand pose estimation”指出了该技术的核心是对3D手势姿态进行估计。而在描述中所提到的“基于深度图的”强调了该方法是通过分析深度数据来进行姿态估计的。这项技术的一个关键优势在于其能够在特定的视角范围内，例如[70,120]度范围内，实现相对较低的平均误差（10毫米），这在日常交互场景中是相当有用的。在标签中使用的“3D”这个词简单明了，表明这项技术与三维空间处理紧密相关，可以处理并识别出三维空间中的手势动作。详细内容中提到的“单帧3D姿态估计”、“3D手势跟踪”和“物体交互中的手势姿态估计”这三项任务，分别代表了不同的手势识别场景。单帧3D姿态估计关注的是单个画面内手势姿态的即时识别；而3D手势跟踪则更注重于连续画面中手势的动态追踪；物体交互中的手势姿态估计则关注于人在与物体交互时的手势识别，这是一个更具有挑战性的场景，因为用户的手势往往会被物体遮挡。文章中还提到，目前深度学习卷积神经网络（CNN）在处理3D手势识别任务时有着出色的表现。特别是3D体素表示法在捕捉深度数据的空间结构方面优于2D CNN。体素表示法是指将三维空间分割成一个个小立方体，每一个立方体在深度图像中都可以被识别和处理。研究者还发现了当前方法在极端视角下的表现还远未达到完美的程度，而对未见手型的泛化能力也较差，这意味着目前的技术对于新的、未曾训练过的人的手势识别仍然存在挑战。同时，关节遮挡问题对大多数方法来说都是一个挑战，但是通过明确的结构约束建模可以显著减小可见关节和遮挡关节间误差的差异。另外，通过分析不同CNN结构的性能，如手型、关节可见性、视角和关节活动分布，研究人员在报告中提出了一些未来的研究目标和方向。例如，针对极端视角和未见过的手型进行改进，以及如何更好地模拟结构约束来处理关节遮挡问题。从整体来看，这篇文章通过分析当前深度图像在3D手势姿态估计方面所取得的成就和存在的问题，为未来的研究指明了方向。这项技术的进一步发展将对增强现实、虚拟现实、人机交互等领域产生重要影响，使得未来的交互技术更加自然、直观和高效。

资源推荐

资源详情

资源评论

Depth-Based 3D Hand Pose Estimation:

From Current Achievements to Future Goals

Shanxin Yuan

Guillermo Garcia-Hernando

orn Stenger

Gyeongsik Moon

Ju Yong Chang

Kyoung Mu Lee

Pavlo Molchanov

Jan Kautz

Sina Honari

Liuhao Ge

Junsong Yuan

Xinghao Chen

Guijin Wang

Fan Yang

Kai Akiyama

Yang Wu

Qingfu Wan

Meysam Madadi

Sergio Escalera

13,14

Shile Li

Dongheui Lee

15,16

Iason Oikonomidis

Antonis Argyros

Tae-Kyun Kim

Abstract

In this paper, we strive to answer two questions: What

is the current state of 3D hand pose estimation from depth

images? And, what are the next challenges that need to

be tackled? Following the successful Hands In the Million

Challenge (HIM2017), we investigate the top 10 state-of-

the-art methods on three tasks: single frame 3D pose esti-

mation, 3D hand tracking, and hand pose estimation during

object interaction. We analyze the performance of different

CNN structures with regard to hand shape, joint visibility,

view point and articulation distributions. Our ﬁndings in-

clude: (1) isolated 3D hand pose estimation achieves low

mean errors (10 mm) in the view point range of [70, 120]

degrees, but it is far from being solved for extreme view

points; (2) 3D volumetric representations outperform 2D

CNNs, better capturing the spatial structure of the depth

data; (3) Discriminative methods still generalize poorly to

unseen hand shapes; (4) While joint occlusions pose a chal-

lenge for most methods, explicit modeling of structure con-

straints can signiﬁcantly narrow the gap between errors on

visible and occluded joints.

1. Introduction

The ﬁeld of 3D hand pose estimation has advanced

rapidly, both in terms of accuracy [2, 4, 5, 6, 7, 26, 28,

32, 38, 45, 48, 49, 53, 54, 57, 60, 62] and dataset qual-

ity [10, 22, 43, 47, 52, 59]. Most successful methods treat

the estimation task as a learning problem, using random

Imperial College London,

Rakuten Institute of Technology,

University

of Crete and FORTH,

Seoul National University,

Kwangwoon Univer-

sity,

NVIDIA,

University of Montreal,

Nanyang Technological Uni-

versity,

State University of New York at Buffalo,

Tsinghua Univer-

sity,

Nara Institute of Science and Technology,

Fudan University,

Computer Vision Center,

University of Barcelona,

Technical Uni-

versity of Munich,

German Aerospace Center.

Corresponding author’s email: s.yuan14@imperial.ac.uk

forests or convolutional neural networks (CNNs). How-

ever, a review from 2015 [44] surprisingly concluded that

a simple nearest-neighbor baseline outperforms most exist-

ing systems. It concluded that most systems do not gener-

alize beyond their training sets [44], highlighting the need

for more and better data. Manually labeled datasets such

as [31, 41] contain just a few thousand examples, making

them unsuitable for large-scale training. Semi-automatic

annotation methods, which combine manual annotation

with tracking, help scaling the dataset size [43, 46, 52], but

in the case of [46] the annotation errors are close to the low-

est estimation errors. Synthetic data generation solves the

scaling issue, but has not yet closed the realism gap, leading

to some kinematically implausible poses [37].

A recent study conﬁrmed that cross-benchmark testing is

poor due to different capture set-ups and annotation meth-

ods [59]. It showed that training a standard CNN on a

million-scale dataset achieves state-of-the-art results. How-

ever, the estimation accuracy is not uniform, highlighting

the well-known challenges of the task: variations in view

point and hand shape, self-occlusion, and occlusion caused

by objects being handled.

In this paper we analyze the top methods of the HIM2017

challenge [58]. The benchmark dataset includes data

from BigHand2.2M [59] and the First-Person Hand Action

dataset (FHAD) [10], allowing the comparison of different

algorithms in a variety of settings. The challenge considers

three different tasks: single-frame pose estimation, track-

ing, and hand-object interaction. In the evaluation we con-

sider different network architectures, preprocessing strate-

gies, and data representations. Over the course of the chal-

lenge the lowest mean 3D estimation error could be reduced

from 20 mm to less than 10 mm. This paper analyzes the

errors with regard to seen and unseen subjects, joint visibil-

ity, and view point distribution. We conclude by providing

insights for designing the next generation of methods.

arXiv:1712.03917v2 [cs.CV] 29 Mar 2018

Figure 1: Evaluated tasks. For each scenario the goal is to infer the 3D locations of the 21 hand joints from a depth image. In Single

frame pose estimation (left) and the Interaction task (right), each frame is annotated with a bounding box. In the Tracking task (middle),

only the ﬁrst frame of each sequence is fully annotated.

Related work. Public benchmarks and challenges in

other areas such as ImageNet [35] for scene classiﬁcation

and object detection, PASCAL [9] for semantic and ob-

ject segmentation, and the VOT challenge [19] for object

tracking, have been instrumental in driving progress in their

respective ﬁeld. In the area of hand tracking, the review

from 2007 by Erol et al. [8] proposed a taxonomy of ap-

proaches. Learning-based approaches have been found ef-

fective for solving single-frame pose estimation, optionally

in combination with hand model ﬁtting for higher precision,

e.g., [50]. The review by Supancic et al. [44] compared 13

methods on a new dataset and concluded that deep models

are well-suited to pose estimation [44]. It also highlighted

the need for large-scale training sets in order to train mod-

els that generalize well. In this paper we extend the scope

of previous analyses by comparing deep learning methods

on a large-scale dataset, carrying out a ﬁne-grained analysis

of error sources and different design choices.

2. Evaluation tasks

We evaluate three different tasks on a dataset containing

over a million annotated images using standardized evalu-

ation protocols. Benchmark images are sampled from two

datasets: BigHand2.2M [59] and First-Person Hand Action

dataset (FHAD) [10]. Images from BigHand2.2M cover a

large range of hand view points (including third-person and

ﬁrst-person views), articulated poses, and hand shapes. Se-

quences from the FHAD dataset are used to evaluate pose

estimation during hand-object interaction. Both datasets

contain 640 × 480-pixel depth maps with 21 joint anno-

tations, obtained from magnetic sensors and inverse kine-

matics. The 2D bounding boxes have an average diagonal

length of 162.4 pixels with a standard deviation of 40.7 pix-

els. The evaluation tasks are 3D single hand pose estima-

tion, i.e., estimating the 3D locations of 21 joints, from (1)

individual frames, (2) video sequences, given the pose in

the ﬁrst frame, and (3) frames with object interaction, e.g.,

with a juice bottle, a salt shaker, or a milk carton. See Fig-

ure 1 for an overview. Bounding boxes are provided as input

for tasks (1) and (3). The training data is sampled from the

BigHand2.2M dataset and only the interaction task uses test

data from the FHAD dataset. See Table 1 for dataset sizes

and the number of total and unseen subjects for each task.

Number of Train Test Test Test

single track interact

frames 957K 295K 294K 2,965

subjects (unseen) 5 10 (5) 10 (5) 2 (0)

Table 1: Data set sizes and number of subjects.

3. Evaluated methods

We evaluate the top 10 among 17 participating meth-

ods [58]. Table 2 lists the methods with some of their key

properties. We also indirectly evaluate DeepPrior [29] and

REN [15], which are components of rvhand [1], as well as

DeepModel [61], which is the backbone of LSL [20]. We

group methods based on different design choices.

2D CNN vs. 3D CNN. 2D CNNs have been popular for

3D hand pose estimation [1, 3, 14, 15, 20, 21, 23, 29, 57,

61]. Common pre-processing steps include cropping and

resizing the hand volume by normalizing the depth values

to [-1, 1]. Recently, several methods have used a 3D CNN

[12, 24, 56], where the input can be a 3D voxel grid [24, 56],

or a projective D-TSDF volume [12]. Ge et al. [13] project

the depth image onto three orthogonal planes and train a

2D CNN for each projection, then fusing the results. In

[12] they propose a 3D CNN by replacing 2D projections

with a 3D volumetric representation (projective D-TSDF

volumes [40]). In the HIM2017 challenge [58], they ap-

ply a 3D deep learning method [11], where the inputs are

3D points and surface normals. Moon et al. [24] propose

a 3D CNN to estimate per-voxel likelihoods for each hand

joint. NAIST RV [56] proposes a 3D CNN with a hierarchi-

cal branch structure, where the input is a 50

-voxel grid.

Detection-based vs. Regression-based. Detection-

based methods [23, 24] produce a probability density map

for each joint. The method of RCN-3D [23] is an RCN+ net-

work [17], based on Recombinator Networks (RCN) [18]

with 17 layers and 64 output feature maps for all layers ex-

cept the last one, which outputs a probability density map

for each of the 21 joints. V2V-PoseNet [24] uses a 3D

CNN to estimate per-voxel likelihood of each joint, and a

CNN to estimate the center of mass from the cropped depth

map. For training, 3D likelihood volumes are generated by

placing normal distributions at the locations of hand joints.

Regression-based methods [1, 3, 11, 14, 20, 21, 29, 56]

剩余9页未读，继续阅读

评论收藏

内容反馈

无名小卒000001

粉丝: 112
资源: 39

depth-based 3d hand pose estimation

Hand-Pose-Estimation:这个简单的网络应用程序使用预先训练的模型通过网络摄像头检测您的手

pose-estimation-3d-with-stereo-camera:该演示使用深度神经网络和两个通用相机来执行 3D 姿态估计。-matlab开发

Two-step pose estimation method based on five reference points

Face pose estimation based on kernelizedmaximum separability

Summary of <<Monocular depth estimation based on deep laerning>>.docx

lightweight-human-pose-estimation-3d-demo.pytorch:PyTorch中的实时3D多人姿势估计演示。 OpenVINO后端可用于在CPU上进行快速推断

Depth-Based Hand Pose Estimation: Methods, Data, and Challenges

轻量级人体姿态估计lightweight-human-pose-estimation.rar

keras_Realtime_Multi-Person_Pose_Estimation的模型

cvpr18-3D Human Pose Estimation in the Wild by Adversarial Learning

Learning a Deep Network with Spherical Part Model for 3D Hand Pose Estimation

3d-human-pose-estimation:有关3D人体姿势估计的重要论文

3D Pose Estimation and 3D Model Retrieval for Objects in the Wild

视觉抓取综述Vision-based Robotic Grasping from Object Localization

Vision-based Vehicle Speed Estimation for ITS A Survey.zip

HigherHRNet-Human-Pose-Estimation-master.zip

Deep Learning-Based Human Pose Estimation A Survey综述xmind版

tf-pose-estimation-master.zip

人体姿态检测总结，Deep Learning-Based Human Pose Estimation: A Survey

Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.docx

A Survey on 3D Hand Pose Estimation Cameras, Methods, and Datasets.pdf

A2J Anchor-to-Joint Regression Network for 3D Articulated Pose Estimation.pdf

2D/3D Pose Estimation and Action Recognition using Multitask Deep Learning

Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields中文翻译

Fine-Grained Head Pose Estimation Without Keypoints

tf-pose-estimation-master-韦访源码分析带注释

A Survey on Deep Learning Techniques for Stereo-based Depth Estimation.pdf

Efficient Human Pose Estimation from Single Depth Images

awesome-human-pose-estimation-master.zip

最新资源