Fast3D Recognition and Pose Using the Viewpoint Feature Histogram Radu Bogdan Rusu,Gary Bradski,Romain Thibaux,John Hsu
Willow Garage
68Willow Rd.,Menlo Park,CA94025,USA
{rusu,bradski,thibaux,hsu}@willowgarage
Abstract—We present the Viewpoint Feature Histogram (VFH),a descriptor for3D point cloud data that encodes geometry and viewpoint.We demonstrate experimentally on a set of60objects captured with stereo cameras that VFH can be used as a distinctive signature,allowing simultaneous recognition of the object and its pose.The pose is accurate enough for robot manipulation,and the computational cost is low enough for real time operation.VFH was designed to be robust to large surface noise and missing depth information in order to work reliably on stereo data.
I.I NTRODUCTION
As part of a long term goal to develop reliable capabilities in the area of perception for mobile manipulation,we address a table top manipulation task involving objects that can be manipulated by
one robot hand.Our robot is shown in Fig.1. In order to manipulate an object,the robot must reliably identify it,as well as its6degree-of-freedom(6DOF)pose. This paper proposes a method to identify both at the same time,reliably and at high speed.
We make the following assumptions.
•Objects are rigid and relatively Lambertian.They can be shiny,but not reflective or transparent.
•Objects are in light clutter.They can be easily seg-mented in3D and can be grabbed by the robot hand without obstruction.
•The item of interest can be grabbed directly,so it is not occluded.
•Items can be grasped even given an approximate pose.
The gripper on our robot can open to9cm and each grip is2.5cm wide which allows an object8.5cm wide object to be grasped when the pose is off by+/-10 degrees.
Despite these assumptions our problem has several prop-erties that make the task difficult.
•The objects need not contain texture.
•Our dataset includes objects of very similar shapes,for example many slight variations of typical wine glasses.•To be usable,the recognition accuracy must be very high,typically much higher than,say,for image retrieval tasks,since false positives have very high costs and so must be kept extremely rare.
•To interact usefully with humans,recognition cannot take more than a fraction of a second.This puts constraints on computation,but more importantly this precludes the use of accurate but slow3D
acquisition Fig.1.A PR2robot from Willow Garage,showing its grippers and stereo cameras
using lasers.Instead we rely on stereo data,which suffers from higher noise and missing data.
Our focus is perception for mobile manipulation.Working on a mobile versus a stationary robot means that we can’t depend on instrumenting the external world with active vision systems or special lighting,but we can put such devices on the robot.In our case,we use projected texture1 to yield dense stereo depth maps at30Hz.We also cannot ensure environmental conditions.We may move from a sunlit room to a dim hallway into a room with no light at all.The projected texture gives us a fair amount of resilience to local lighting conditions as well.
1Not structured light,this is random texture
Although this paper focuses on3D depth features,2D imagery is clearly important,for example for shiny and transparent objects,or to distinguish items based on texture such as telling apart a Coke can from a Diet Coke can.In our case,the textured light alternates with no light to allow for2D imagery aligned with the texture based dense depth, however adding2D visual features will be studied in future work.Here,we look for an effective purely3D feature. Our philosophy is that one should use or design a recogni-tion algorithm thatfits one’s engineering needs such as scal-ability,training speed,in
cremental training needs,and so on, and thenfind features that make the recognition performance of that architecture meet one’s specifications.For reasons of online training,and because of large memory availability, we choose fast approximate K-Nearest Neighbors(K-NN) implemented in the FLANN library[1]as our recognition architecture.The key contribution of this paper is then the design of a new,computationally efficient3D feature that yields object recognition and6DOF pose.
The structure of this paper is as follows:Related work is described in Section II.Next,we give a brief description of our system architecture in Section III.We discuss our surface normal and segmentation algorithm in Section IV followed by a discussion of the Viewpoint Feature Histogram in Section V.Experimental setup and resulting computational and recognition performance are described in Section VI. Conclusions and future work are discussed in Section VII.
II.R ELATED W ORK
The problem that we are trying to solve requires global (3D object level)classification based on estimated features. This has been under investigation for a long time in various researchfields,such as computer graphics,robotics,and pattern matching,see[2]–[4]for comprehensive reviews.We address the most relevant work below.
Some of the widely used3D point feature extraction approaches include:spherical harmonic invariants[5],spin images[6],curvature maps[7],or more recently,Point Feature Histograms(PFH)[8],and conformal factors[9]. Spherical harmonic invariants and spin images have been successfully used for the problem of object recognition for densely sampled datasets,though their performance seems to degrade for noisier and sparser datasets[4].Our stereo data is noisier and sparser than typical line scan data which motivated the use of our new features.Conformal factors are based on conformal geometry,which is invariant to isometric transformations,and thus obtains good results on databases of watertight models.Its main drawback is that it can only be applied to manifold meshes which can be problematic in stereo.Curvature maps and PFH descriptors have been studied in the context of local shape comparisons for data registration.A side study[10]applied the PFH descriptors to the problem of surface classification into3D geometric primitives,although only for data acquired using precise laser sensors.A different pointfingerprint representation using the projections of geodesic circles onto the tangent plane at a point p i was proposed in[11]for the problem of surface registration.As the authors note,geodesic distances are more sensitive to surface sampling noise,and thus are unsuitable for real sensed data without a priori smoothing and reconstruction.A decomposition of objects into parts learned using spin images is presented in[12]for the problem of vehicle identification.
Methods relying on global features include descriptors such as Extended Gaussian Images(EGI)[13],eigen shapes[14],or shape distributions[15].The latter samples statistics of the entire object and represents them as distri-butions of shape properties,however they do not take into account how the features are distributed over the surface of the object.Eigen shapes show promising results but they have limits on their discrimination ability since important higher order variances are discarded.EGIs describe objects based on the unit normal sphere,but have problems handling arbitrarily curved objects.
The work in[16]makes use of spin-image signatures and normal-based signatures to achieve classification rates over 90%with synthetic and CAD model datasets.The datasets used however are very different than the ones acquired using noisy640×480stereo cameras such as the ones used in our work.In addition,the authors do not provide timing information on the estimation and matching parts which is critical for applications such as ours.A system for fully automatic3D model-based object recognition and segmentation is presented in[17]with good recognition rates of over95%for a database of55objects.Unfortunately,the computational performance of the proposed method is not suitable for real-time as the authors report the segmentation of an object model in a cluttered scene to be around2 minutes.Moreover,the objects in the database are scanned using a high resolution Min
olta scanner and their geometric shapes are very different.As shown in Section VI,the objects used in our experiments are much more similar in terms of geometry,so such a registration-based method would fail. In[18],the authors propose a system for recognizing3D objects in photographs.The techniques presented can only be applied in the presence of texture information,and require a cumbersome generation of models in an offline step,which makes this unsuitable for our work.
As previously presented,our requirements are real-time object recognition and pose identification from noisy real-world datasets acquired using projective texture stereo cam-eras.Our3D object classification is based on an extension of the recently proposed Fast Point Feature Histogram(FPFH) descriptors[8],which record the relative angular directions of surface normals with respect to one another.The FPFH performs well in classification applications and is robust to noise but it is invariant to viewpoint.
This paper proposes a novel descriptor that encodes the viewpoint information and has two parts:(1)an extended FPFH descriptor that achieves O(k∗n)to O(n)speed up over FPFHs where n is the number of points in the point cloud and k is how many points used in each local neighborhood;
(2)a new signature that encodes important statistics between the viewpoint and the surface normals on the object.We call
this new feature the Viewpoint Feature Histogram(VFH)as detailed below.
III.A RCHITECTURE
Our system architecture employs the following processing steps:
•Synchronized,calibrated and epipolar aligned left and right images of the scene are acquired.
•A dense depth map is computed from the stereo pair.•Surface normals in the scene are calculated.•Planes are identified and segmented out and the remain-ing point clouds from non-planar objects are clustered in Euclidean space.
•The Viewpoint Feature Histogram(VFH)is calculated over large enough objects(here,objects having at least 100points).
–If there are multiple objects in a scene,they are processed front to back relative to the camera.
–Occluded point clouds with less than75%of the number of points of the frontal objects are noted
but not identified.
•Fast approximate K-NN is used to classify the object and its view.
Some steps from the early processing pipeline are shown in Figure2.Shown left to right,top to bottom in thatfigure are: a moderately complex scene with many different vertical and horizontal surfaces,the resulting depth map,the estimated surface normals and the objects segmented from the planar surfaces in the
scene.
Fig.2.Early processing steps row wise,top to bottom:A scene,its depth
map,surface normals and segmentation into planes and outlier objects.
For computing3D depth maps,we use640x480stereo
with textured light.The textureflashes on only very briefly
as the cameras take a picture resulting in lights that look dim
to the human eye but bright to the camera.Textureflashes
only every other frame so that raw imagery without texture
can be gathered alternating with densely textured scenes.The
stereo has a38degreefield of view and is designed for close
in manipulation tasks,thus the objects that we deal with are
from0.5to1.5meters away.The stereo algorithm that we
use was developed in[19]and uses the implementation in the
OpenCV library[20]as described in detail in[21],running
at30Hz.
IV.S URFACE N ORMALS AND3D S EGMENTATION
We employ segmentation prior to the actual feature es-
timation because in robotic manipulation scenarios we are
only interested in certain precise parts of the environment,
and thus computational resources can be saved by tackling
only those parts.Here,we are looking to manipulate reach-
able objects that lie on horizontal surfaces.Therefore,our
segmentation scheme proceeds at extracting these horizontal
surfaces
first.
Fig.3.From left to right:raw point cloud dataset,planar and cluster
segmentation,more complex segmentation.
Compared to our previous work[22],we have improved
the planar segmentation algorithms by incorporating surface
normals into the sample selection and model estimation
steps.We also took care to carefully build SSE aligned
data structures in memory for any computationally expensive
operation.By rejecting candidates which do not support
our constraints,our system can segment data at about7Hz,
including normal estimation,on a regular Core2Duo laptop
using a single core.To get frame rate performance(realtime),
we use a voxelized data structure over the input point cloud
and downsample with a leaf size of0.5cm.The surface
normals are therefore estimated only for the downsampled
result,but using the information in the original point cloud.
The planar components are extracted using a RMSAC(Ran-
domized MSAC)method that takes into account weighted
averages of distances to the model together with the angle
of the surface normals.We then select candidate table planes
using a heuristic combining the number of inliers which
support the planar model as well as their proximity to the
camera viewpoint.This approach emphasizes the part of the
space where the robot manipulators can reach and grasp the
objects.
The segmentation of object candidates supported by the
table surface is performed by looking at points whose projec-
tion falls inside the bounding2D polygon for the table,and
applying single-link clustering.The result of these processing
steps is a set of Euclidean point clusters.This works to
reliably segment objects that are separated by about half their
minimum radius from each other.An can be seen in Figure3.
To resolve further ambiguities with to the chosen candidate clusters,such as objects stacked on other planar objects(such as books),we repeat the mentioned step by treating each additional horizontal planar structure on top of the table candidates as a table itself and repeating the segmentation step(see results in Figure3).
We emphasize that this segmentation step is of extreme importance for our application,because it allows our methods to achieve favorable computational performances by extract-ing only the regions of interest in a ,objects that are to be manipulated,located on horizontal surfaces).In cases where our“light clutter”assumption does not hold and the geometric Euclidean clustering is prone to failure, a more sophisticated segmentation scheme based on texture properties could be implemented.
V.V IEWPOINT F EATURE H ISTOGRAM
In order to accurately and robustly classify points with respect to their underlying surface,we borrow ideas from the recently proposed Point Feature Histogram(PFH)[10]. The PFH is a histogram that collects the pairwise pan,tilt and yaw angles between every pair of normals on a surface patch (see Figure4).In detail,for a pair of3D points p i,p j ,and their estimated surface normals n i,n j ,the set of normal angular deviations can be estimated as:
α=v·n j
φ=u·(p j−p i)
d
θ=arctan(w·n j,u·n j)
(1)
where u,v,w represent a Darboux frame coordinate system chosen at p i.Then,the Point Feature Histogram at a patch of points P={p i}with i={1···n}captures all the sets of α,φ,θ between all pairs of p i,p j from P,and bins the results in a histogram.The bottom left part of Figure4 presents the selection of the Darboux frame and a graphical representation of the three angular features.
Because all possible pairs of points are considered,the computation complexity of a PFH is O(n2)in the number of surface normals n.In order to make a more efficient algorithm,the Fast Point Feature Histogram[8]was de-veloped.The FPFH measures the same angular features as PFH,but estimates the sets of values only between every point and its k nearest neighbors,followed by a reweighting of the resultant histogram of a point with the neighboring histograms,thus reducing the computational complexity to O(k∗n).
Our past work[22]has shown that a global descriptor (GFPFH)can be constructed from the classification results of many local FPFH features,and used on a wide range of confusable objects(20different types of glasses,bowls, mugs)in500scenes achieving96.69%on object class reco
gnition.However,the categorized objects were only split into4distinct classes,which leaves the scaling problem open.Moreover,the GFPFH is susceptible to the errors of the local classification results,and is more cumbersome to estimate.
In any case,for manipulation,we require that the robot not only identifies objects,but also recognizes their6DOF poses for grasping.FPFH is invariant both to object scale (distance)and object pose and so cannot achieve the latter task.
In this work,we decided to leverage the strong recognition results of FPFH,but to add in viewpoint variance while retaining invariance to scale,since the dense stereo depth map gives us scale/distance directly.Our contribution to the problem of object recognition and pose identification is to extend the FPFH to be estimated for the entire object cluster (as seen in Figure4),and to compute additional statistics between the viewpoint direction and the normals estimated at each point.To do this,we used the key idea of mixing the viewpoint direction directly into the relative normal angle calculation in the FPFH.Figure6presents this idea with the new feature consisting of two parts:(1)a viewpoint direction component(see Figure5)and(2)a surface shape component comprised of an extended FPFH(see Figure4).
The viewpoint component is computed by collecting a histogram of the angles that the viewpoint direction makes with each normal.Note,we do not mean the view angle to each normal as this would not be scale invariant,but instead we mean the angle between the central viewpoint direction translated to each normal.The second component measures the relative pan,tilt and yaw angles as described in[8],[10] but now measured between the viewpoint direction at the central point and each of the normals on the surface.We call the new assembled feature the Viewpoint Feature Histogram (VFH).Figure6presents the resultant assembled VFH for a random object.
p
i
α
Fig.5.The Viewpoint Feature Histogram is created from the extended
Fast Point Feature Histogram as seen in Figure4together with the statistics
of the relative angles between each surface normal to the central viewpoint
direction.
The computational complexity of VFH is O(n).In our
experiments,we divided the viewpoint angles into128bins
object toand theα,φandθangles into45bins each or a total of263
dimensions.The estimation of a VFH takes about0.3ms on
average on a2.23GHz single core of a Core2Duo machine
using optimized SSE instructions.
p 7
p p 8
p 9
p 10
p 11
p 5
p 1
p p 3
p 4
c
n c =u
u
n 5v=(p 5-c)×u w=u ×v
c p 5w
v αφ
θFig.4.The extended Fast Point Feature Histogram collects the statistics of the relative angles between the surface normals at each point to the surface normal at the centroid of the object.The bot
tom left part of the figure describes the three angular feature for an example pair of points.
Viewpoint component
extended FPFH component
Fig.6.An example of the resultant Viewpoint Feature Histogram for one of the objects used.Note the two concatenated components.
VI.V ALIDATION AND E XPERIMENTAL R ESULTS To evaluate our proposed descriptor and system archi-tecture,we collected a large dataset consisting of over 60IKEA kitchenware objects as show in Figure 8.These objects consisted of many kinds each of:wine glasses,tumblers,drinking glasses,mugs,bowls,and a couple of boxes.In each of these categories,many of the objects were distinguished only by subtle variations in shape as can be seen for example in the confusions in Figure 10.We captured over 54000scenes of these objects by spinning them on a turn table 180◦2at each of 2offsets on a platform that tilted 0,8,16,22and 30degrees.Each 180◦rotation was captured with about 90images.The turn table is shown in Fig.7.We additionally worked with a subset of 20objects in 500lightly cluttered scenes with varying arrangements of horizontal and vertical surfaces,using the same data set provided by in [22].No
2We
didn’t go 360degrees so that we could keep the calibration box in
view
Fig.7.The turn table used to collect views of objects with known orientation.
pose information was available for this second dataset so we only ran experiments separately for object recognition results.
The complete source code used to generate our experimen-tal results together with both object databases are available under a BSD open source license in our ROS repository at Willow Garage 3.We are currently taking steps towards creating a web page with complete tutorials on how to fully replicate the experiments presented herein.
Both the objects in the [22]dataset as well as the ones we acquired,constitute valid examples of objects of daily use that our robot needs to be able to reliably identify and manipulate.While 60objects is far from the number of objects the robot eventually needs to be able to recognize,it may be enough if we assume that the robot knows what
3

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系QQ:729038198,我们将在24小时内删除。