5th International Workshop on

Observing and Understanding Hands in Action

New! A paper summarising and analysing the results of the HANDS 2019 challenge is published at ECCV'20 and now available here. If you wish to reference this challenge, please find the bibtex at the end of the page!

Abstract

Challenge overview

Task 1: Depth-Based 3D Hand Pose Estimation

Task 2: Depth-Based 3D Hand Pose Estimation while Interacting with Objects

Task 3: RGB-Based 3D Hand Pose Estimation while Interacting with Objects

Synthetic data using MANO model

Visualization Toolbox

Participation

Joint annotation model and data format

MANO annotation model, data format and rendering:

6D Object Pose and Annotations:

Results submission and leaderboard

Challenge awards (Updated 28/10/2019)

Citation

Abstract

We present the HANDS19 Challenge, a public competition designed for the evaluation of the task of 3D hand pose estimation in both depth and colour modalities in the presence and absence of objects. The main goals of this challenge are to assess the performance of state of the art approaches in terms of interpolation-extrapolation capabilities of hand variations in their main four axes (shapes, articulations, viewpoints, objects), and the use of synthetic data to fill the gaps of current datasets on these axes. The challenge builds on recent datasets BigHand2.2M, F-PHAB and HO-3D datasets, which have been designed to exhaustively cover multiple hand shapes, viewpoints, articulations and both self-occlusion and occlusion from objects using both depth and RGB cameras. Despite being the most exhaustive available datasets for their respective tasks, they lack full coverage of the hand variability. In order to fill these gaps, parameters of a fitted hand model (MANO) and a toolkit to synthesize data are provided to participants. Training and test splits are carefully designed to study the interpolation and extrapolation capabilities of participants' techniques on these mentioned axes and the potential benefit of using such synthetic data. The challenge consists of a standardized dataset, an evaluation protocol for three different tasks and a public competition. Participating methods will be analyzed and ranked according to their performance on the mentioned axes. Winners and prizes will be announced and awarded during the workshop and results will be disseminated in a subsequent challenge publication.

Challenge overview

In each task the aim is to predict the 21 joints’ 3D locations for each given image (details on annotation below). In training both hand pose annotations and MANO fitting parameters are provided for each image. For inference, only depth/RGB images and hand's bounding boxes are provided.

Task 1: Depth-Based 3D Hand Pose Estimation. This task builds on BigHand2.2M dataset in a similar format to HANDS 2017 challenge. Some hand shapes, articulations and viewpoints are strategically excluded from the training set in order to measure interpolation and extrapolation capabilities of submissions. No objects appear in this task. Hands appear in both 3rd person and egocentric viewpoints.
Task 2: Depth-Based 3D Hand Pose Estimation while Interacting with Objects: This task builds on F-PHAB dataset. Objects appear being manipulated by a subject in an egocentric viewpoint. Some hand shapes and objects are strategically excluded from the training set in order to measure interpolation and extrapolation capabilities of submissions.
Task 3: RGB-Based 3D Hand Pose Estimation while Interacting with Objects: This task builds on HO-3D dataset. Objects appear being manipulated by a subject in a 3rd person viewpoint. Some hand shapes and objects are strategically excluded from the training set in order to measure interpolation and extrapolation capabilities of submissions.

Task 1: Depth-Based 3D Hand Pose Estimation

This task builds on BigHand2.2M dataset in a similar format to HANDS 2017 challenge. Hands appear in both 3rd person and egocentric viewpoints. No objects are present in this task.

Training set: Contains images from 5 different subjects. Some hand articulations and viewpoints are strategically excluded.
Test set: Contains images from 10 different subjects. 5 subjects overlap with the training set. Exhaustive coverage of viewpoints and articulations.
The following performance scores (as mean joint error) will be evaluated:
- Interpolation (INTERP.): performance on test samples that have shape, viewpoints and articulations present in the training set.
- Extrapolation:
  - Total (EXTRAP.): performance on test samples that have hand shapes, viewpoints and articulations not present in the training set.
  - Shape (SHAPE): performance on test samples that have hand shapes not present in the training set. Viewpoints and articulations are present in the training set.
  - Articulation (ARTIC.): performance on test samples that have articulations not present in the training set. Shapes and viewpoints are present in the training set.
  - Viewpoint (VIEWP.): performance on test samples that have viewpoints not present in the training set. Shapes and articulations are present in the training set. Viewpoint is defined as elevation and azimuth angles of the hand respect to the camera. Both angles are analyzed independently.
Use of fitted MANO model for synthesizing data is encouraged.
Images are captured with Intel RealSense SR300 camera at 640 × 480-pixel resolution.
Use of training data from HANDS 2017 challenge is not allowed as some images may overlap with the test set.
Use of other labelled datasets (either real or synthetic) is not allowed. Use of fitted MANO model for synthesizing data is encouraged. Use of external unlabelled data is allowed (self-supervised and unsupervised methods).
Task 1 Submission website

Task 2: Depth-Based 3D Hand Pose Estimation while Interacting with Objects

This task builds on F-PHAB dataset with the following characteristics:

Hands appear in egocentric viewpoint and interacting with different objects.
Training set: Contains images from 4 different subjects performing 45 different actions involving 26 different objects. These frames come from the original F-PHAB dataset release.
Test set: Contains images from 4 different subjects performing 71 different actions involving 37 different objects. Some actions (45), objects (26) and users (2) overlap with the training set. Some frames appear in the the original F-PHAB dataset release and some are unreleased (new) frames.
The following performance scores (as mean joint error) will be evaluated:
- Interpolation (INTERP.): performance on test samples that have hand shape and actions present in the training set.
- Extrapolation:
  - Total (EXTRAP.): performance on test samples that have hand shapes and actions not present in the training set.
  - Shape (SHAPE): performance on test samples that have hand shapes not present in the training set. Actions are presents in the training set.
  - Object (OBJECT): performance on test samples involving actions not present in the training set. Hand shapes are present in the training set.
Use of object and MANO models for synthesizing data is encouraged. 6D object pose is available (in training) for a subset of 10 actions.
Images are captured with Intel RealSense SR300 camera at 640 × 480-pixel resolution.
Use of the full F-PHAB dataset as training is not allowed as some images may overlap with the test set.
Use of other labelled datasets (either real or synthetic) is not allowed. Use of fitted MANO model for synthesizing data is encouraged. Use of external unlabelled data is allowed (self-supervised and unsupervised methods).

Task 3: RGB-Based 3D Hand Pose Estimation while Interacting with Objects

This task builds on HO-3D dataset with the following characteristics:

Hands appear in third-person viewpoint and interacting with different objects in continuous sequences. The user moves the object in the scene by changing the relative viewpoint between hand and camera. The relative position between hand and object does not change during the sequence.
Training set: Contains images from 3 different subjects manipulating 4 different objects. In total there are 12 sequences.
Test set: Contains images from 5 different subjects (2 appear in the training set) manipulating 6 different objects (4 appear in the training set). In total there are 5 full sequences and some frames sampled from the training sequences.
The following performance scores (as mean joint error) will be evaluated:
- Interpolation (INTERP.): performance on test frames sampled from training sequences (not present in training set).
- Extrapolation:
  - Total (EXTRAP.): performance on test samples that have hand shapes and objects not present in the training set.
  - Shape (SHAPE): performance on test samples that have hand shapes not present in the training set. Objects are presents in the training set.
  - Object (OBJECT): performance on test samples involving objects not present in the training set. Hand shapes are present in the training set.
Use of object and MANO models for synthesizing data is encouraged. 6D object pose is available (in training) for all images.
Both RGB and Depth images are available for training. Only RGB images are available on the test set.
Images are captured with Intel RealSense SR300 camera at 640 × 480-pixel resolution.
Use of other labelled datasets (either real or synthetic) is not allowed. Use of fitted MANO model for synthesizing data is encouraged. Use of external unlabelled data is allowed (self-supervised and unsupervised methods).

Synthetic data using MANO model

output_task1.avi

output_task2.avi

Participants are encouraged to generate and use synthetic data for the tasks. For this purpose, we benefit from MANO model, a parametric deformable 3D hand model. Participants are required to obtain the right hand MANO model (MANO_RIGHT.pkl) by registering on the MANO page.

Parameters of the model and tools (Python scripts) to create synthetic representations of the real data are provided in each task for the participants.
For Tasks 1 and Task 2, MANO parameters are learned by fitting the model on the provided real training images in an optimization framework.
- Our fitting method uses a gradient based optimization. Starting from a canonical pose of MANO model all model parameters - camera, global rotation, shape and articulation - of the MANO model are updated iteratively. The loss function consists of i) the Euclidean distance between 3D ground-truth skeletons of the real data and 3D skeletons of the MANO model under the current parameter setting (MCP joints are placed slightly different between GT annotation and MANO model, we did not consider this in our optimization); ii) A shape regularizer enforcing the shape parameters to be close to the MANO model’s mean values, normalized to 0, to maximize the shape likelihood; and iii) A Laplacian regularizer to obtain the smooth mesh surfaces.
- Fitting quality: We measure the 2D and 3D MSE joint errors between the fitted model and the ground truth annotations. In average, we measured 6.44 pixel and 11.02 mm error for Task 1 and 9.06 pixel and 11.39 mm for Task 2. Participants are encouraged to improve these errors!
For fitting method of Task 3, please refer to HO-3D dataset.

Visualization Toolbox

A public repository is provided for getting started. Participants can use the source codes for visualization of the dataset with annotations and renderings of the MANO parameters. You can access the toolbox here.

Participation

Submission deadline is 9th October 2019 (extended)
To participate and download the dataset please fill this form and accept the terms and conditions.
Submissions are processed through CodaLab (read instructions below):
- Task 1
- Task 2 [closed]
- Task 3 [closed]
In order for participants to be eligible for competition prizes and be included in the official rankings (to be presented during the workshop and subsequent publications), information about their submission must be provided to organizers. Information may include, but not limited to, details on their method, synthetic and real data use, architecture and training details. Check previous challenge publication to have an idea of the information needed.
Winning methods may be asked to provide their source code to reproduce their results under strict confidentiality rules if requested by participants.
For each submission, participants must keep the parameters of their method constant across all testing data for a given task.
Use of object models and MANO parameters is encouraged to synthesize new data. Note that the use of such techniques must be reported in detail when submitting in order to help comparison among submissions.

Joint annotation model and data format

The annotation file is a text file, each line is the annotation for a frame and is formatted as the following:
- Each line has 64 items, the first item is the frame name.
- The rest 63 items are [x y z] values of the 21 joints in real-world coordinates (mm)
Please note that the annotations of Task 3 - HO-3D dataset - assume openGL coordinate system i.e., hand/objects are along negative z-axis in a right-handed coordinate system with origin at camera optic center.
Task 1 and Task 2 images are in mm scale but the depth images of Task 3 are in meter scale.

The joints are ordered in this way: [Wrist, TMCP, IMCP, MMCP, RMCP, PMCP, TPIP, TDIP, TTIP, IPIP, IDIP, ITIP, MPIP, MDIP, MTIP, RPIP, RDIP, RTIP, PPIP, PDIP, PTIP], where ’T’, ’I’, ’M’, ’R’, ’P’ denote ’Thumb’, ’Index’, ’Middle’, ’Ring’, ’Pinky’ fingers. ’MCP’, ’PIP’, ’DIP’, ’TIP’ as in the following Figure:

MANO annotation model, data format and rendering:

Task 1 and Task 2: The annotation file is a tab delimited text file, each line consists of MANO parameters for a frame and is formatted as the following:
- Each line has 64 items, the first item is the frame name.
- The rest 63 items represent the MANO parameters, c, q, a and s :
  - - c: 4 dimensional camera parameters (scale + translation along three camera axes),
    - q: 4 dimensional quaternion representation for the global rotation,
    - a: 45 dimensional articulation parameters,
    - s: 10 dimensional shape parameters.
Task 3: The annotation file is a tab delimited text file, each line consists of MANO parameters for a frame and is formatted as the following:
- Each line has 62 items, the first item is the frame name.
- The rest 61 items represent the MANO parameters, c, q, a and s :
  - - c: 3 dimensional camera parameters (translation along three camera axes),
    - r: 3 dimensional euler angles representation in radians for the global rotation,
    - a: 45 dimensional articulation parameters,
    - s: 10 dimensional shape parameters.
We provide python scripts for the participants to easily render the MANO model with the given parameters. Please refer to the inline source code comments under "src" for the usage details.
- Usage: python src/visualize_task1_task2.py –task-id=1 –frame-id=0 --use-mano --mano-model-path=./MANO_RIGHT.pkl

python src/visualize_task3.py –frame-id=0 --mano-model-path=./MANO_RIGHT.pkl

6D Object Pose and Annotations:

The annotation file is a tab delimited text file, each line consists of object global rotation and translation for a frame and is formatted as the following:
- Each line has 8 items, the first item is the frame name and the second item is the object id.
- The rest 6 items represent the global rotation in Euler angles (yaw, pitch, roll) and the global translation along (x, y, z) axes.
We provide corners for each of the models which contain the 8 3D-bounding-box corners of the objects. You can further download the object meshes for the corresponding object ids in the training set from here.
Note that all annotations of HO-3D dataset assume openGL coordinate system i.e., hand/objects are along negative z-axis in a right-handed coordinate system with origin at camera optic center.
You can visualize the object poses with: python src/visualize_task3.py –frame-id=0

Results submission and leaderboard

Submissions will be automatically processed using the service CodaLab. Participants can upload their results and obtain different scores for their submission (i.e. interpolation and extrapolation). During the submission period the order on the leaderboard will be based on the reported average joint error discussed above on the total extrapolation test subset. However, the final awards and leaderboard will be disclosed during the workshop and it will depend on fine-grained analyses of their performance on the different interpolation-extrapolation aspects mentioned above. The challenge organizers reserve their right to disqualify participants who do not follow the participation rules stated above.

- Task 1 Codalab submission website.
- Task 2 Codalab submission website.
- Task 3 Codalab submission website.

How to prepare your submission

Submissions for each task will be treated independently on CodaLab. You can choose if you want to participate in all tasks or only in one, however you only need to register once on CodaLab. If you want to try significantly different approaches you can register multiple times.

Generate a file containing the results of your algorithm in a .txt format and name it ‘result.txt’. Your submission file must look like this. The format is the same as in the provided annotations discussed above.
Compress your .txt file in .zip format. Make sure not extra directories are created within the zip. For instance, try using something like: zip -j result.txt.zip result.txt. The -j flag will not create extra directories within the zip. If you choose to participate in different tasks, you will have to generate different .zip files and submit in its respective task.

How to submit your results:

Register on the desired task on CodaLab. Use the same e-mail as the one used in the form to get accepted into the competition.
Go to your Account settings > Competition settings and choose a ‘Team name’. The team name can be chosen to preserve anonymity.
Go to the Participate tab > Submit/View Results
Choose the task to submit results. Click the button ‘Submit’ and upload your generated .zip file. Here you can monitor the process of your submission. The submission won’t be complete until the tab STATUS is set to Finished. If you click the button ‘+’ you can access details of your submission such as your results and the output log (if your submission fails due to an error in the format, this will be useful). You may find this webpage useful.

Notes:

The submission will fail if you don’t follow exactly the instructions above. If you are absolutely sure that you followed the instructions and your submission still fails, please send us an e-mail and we will have a look at your submission.
It takes around 2 minutes to process one submission, although this depends on how busy are the CodaLab machines (and we can’t do anything about it).
For final ranking and dissemination of results, we will only consider the latest submission by each team at the time that the challenge submission closes. It is responsibility of each team to make sure their latest submission is also their best.

Challenge awards (Updated 28/10/2019)

The challenge prizes were sponsored by Facebook Reality labs. The winners of the HANDS 2019 challenge are the following teams:

Task 1 winners: Team Rokid Corporation Ltd. Zhaohui Zhang, Shipeng Xie and Mingxiu Chen
Task 1 runner-up: Team A2J. Boshen Zhang, Fu Xiong, Yang Xiao, Zhiguo Cao, Taidong Yu, Cunlin Wu, Joey Tianyi Zhou, Junsong Yuan
Task 2 winner: Team NTIS. Marek Hrúz, Jakub Kanis and Zdeněk Krňoul.
Task 3 winner: ETH Zürich and NVIDIA. Adrian Spurr, Umar Iqbal and Pavlo Molchanov.

We thank everyone for their participation and looking forward to seeing you in the next edition!

Citation

If you would like to reference this workshop or challenge, please consider citing the following ECCV 2020 paper:

@InProceedings{armagan_2020_ECCV, title={Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction}, author={Armagan, Anil and Garcia-Hernando, Guillermo and Baek, Seungryul and Hampali, Shreyas and Rad, Mahdi and Zhang, Zhaohui and Xie, Shipeng and Chen, MingXiu and Zhang, Boshen and Xiong, Fu and others}, booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)}, year={2020}}