The last few years have seen rapid advances in 2D object recognition. We can now build systems that accurately recognize objects, localize them with 2D bounding boxes or masks, and predict 2D keypoint positions in cluttered, real-world images. Despite their impressive performance, these systems ignore one critical fact: that the world and the objects within it are 3D and extend beyond the XY image plane.

At the same time, there have been significant advances in 3D shape understanding with deep networks. A menagerie of network architectures have been developed for different 3D shape representations, such as voxels, pointclouds, and meshes; each representation carries its own benefits and drawbacks. However, this diverse and creative set of techniques has primarily been developed on synthetic benchmarks such as ShapeNet consisting of rendered objects in isolation, which are dramatically less complex than natural-image benchmarks used for 2D object recognition like ImageNet and COCO.

We believe that the time is ripe for these hitherto distinct research directions to be combined. We should strive to build systems that (like current methods for 2D perception) can operate on unconstrained real-world images with many objects, occlusion, and diverse lighting conditions but that (like current methods for 3D shape prediction) do not ignore the rich 3D structure of the world.

In this paper we take an initial step toward this goal. We draw on state-of-the-art methods for 2D perception and 3D shape prediction to build a system which inputs a real-world RGB image, detects the objects in the image, and outputs a category label, segmentation mask, and a 3D triangle mesh giving the full 3D shape of each detected object.

Our method, called Mesh R-CNN, builds on the state-ofthe-art Mask R-CNN system for 2D recognition, augmenting it with a mesh prediction branch that outputs highresolution triangle meshes.

See more HERE

Related posts: