Visual Manipulation Relationship Dataset


Perception and cognition play important roles in intelligent robot research. Before interacting with the environment, such as tasks of grasping or manipulating, the robot need to understand and infer what to do and how to do it first. In these situations, detecting targets and predicting the manipulation relationships in real time ensure that the robot can complete tasks in a safe and reliable way. Therefore, we collect a dataset named Visual Manipulation Relationship Dataset (VMRD) which can be used to train the robot to learn to percept and understand the environment, locate its target and find the proper order to grasp it.


Dataset Introduction

Visual Manipulation Relationship Dataset (VMRD) are collected and labeled using hundreds of objects coming from 31 categories. There are totally 5185 images including 17688 object instances and 51530 manipulation relationships. Some images of our dataset and data distribution are shown below. Each object node includes category information, bounding box location, the index of the current node and indexes of its parent nodes andchild nodes.



                               (a)                                                                                                                                              (b)

Fig. 1. (a) Category and manipulation relationship distribution. (b) Some dataset examples.


Annotation Format

Objects and manipulation relationships in VMRD are labeled using XML format. In each annotation file, there are several objects with their locations, indexes, fathers and children. Index of each object is different from any other objects even if they may belong to the same object category. Details are shown below:

  1. "location" is defined as (xmin, ymin, xmax, ymax), which is the coordinates of top-left vertex and down-right vertex of the object bounding box.

  2. "name" is a string which is the category of the object instance.

  3. "index" is the unique ID of the object instance in one image.

  4. "father" includes all indexes of parent nodes of one object instance.

  5. "children" includes all indexes of child nodes of one object instance.


Fig. 2. One example of our dataset.


Grasp Annotation Format

In our dataset, each grasp includes 5 dimensions: (x,y,w,h,θ) as shown in the following figure, where (x,y) is the coordinate of the grasp rectangle center, (w,h) is the width and height of the grasp rectangle and θ is the rotation angle w.r.t. horizontal axis.



However, in grasp annotation files, grasps are labeled using the coordinates of 4 vertexes (x1,y1,x2,y2,x3,y3,x4,y4). The first two points of a rectangle define the line representing the orientation of the gripper plate. All 4 vertexes are listed in counter-clockwise order. Besides, to indicate which object each grasp belongs to, the object index is also added into each annotation. The last dimension is "e" or "h", which means that the grasp is easy or hard (the execution may be hindered by other objects) to execute.There are totally 4683 images labeled with grasps, which are divided into training set (4233 images) and testing set (450   images).


Downloading of our dataset is free and open. Please cite the following paper:

H. Zhang, X. Lan*, X. Zhou, Z. Tian, YZhang, N. Zheng, Visual Manipulation Relationship Network for Autonomous Robotics, IEEE-RAS 18th International Conference on Humanoid Robots (HUMANOIDS), November, 6-9, Beijing, China, 2018


Complete dataset can be downloaded at this link: 


V1 (5185 images without grasps):


V2 (4683 images with grasps):