At this year’s CVPR conference, Facebook AI is pushing the state of the art forward in many important areas of Computer vision (CV), including core segmentation tasks, architecture search, transfer learning, and multimodal learning.

We’re also sharing details on several notable papers that propose new ways to reason about the 3D objects shown in regular 2D images. This work could help us unlock virtual and augmented reality innovations, among other future experiences. We’re also sharing the full list of abstracts and details of our CVPR participation below.

Novel views from only a single image in complex, real-world scenes

We’ve built SynSin, a state-of-the-art, end-to-end model that can take a single RGB image and then generate a new image of the same scene from a different viewpoint without any 3D supervision. Our system works by predicting a 3D point cloud, which is projected onto new views using our novel differentiable renderer via PyTorch3D. The rendered point cloud is then passed to a generative adversarial network (GAN) to synthesize the output image. Current methods often use dense voxel grids, which have shown promise on synthetic scenes of single objects but haven’t been able to scale to complex real-world scenes.

With the flexibility of point clouds, SynSin not only achieves this but also generalizes to varying resolutions with more efficiency than alternatives such as voxel grids. SynSin’s high efficiency could help us explore a wide range of applications, for example, generating better 3D photos and 360-degree videos.

Read the full research paper HERE.

Reconstructing 3D human figures in unprecedented level of detail and quality from a single image

We’ve developed a novel method for generating 3D reconstructions of people from 2D images with state-of-the-art quality and detail. It captures highly intricate details such as fingers, facial features, and clothing folds using high-resolution photos as input, which was not possible with previous techniques without additional processing.

To achieve this, we built upon the highly memory-efficient Pixel-Aligned Implicit Function (PIFu) method and created a hierarchical multilevel neural network architecture to process both global context and local details to achieve high-resolution 3D reconstruction. The first-level network considers the global 3D structure of humans by utilizing lower-resolution input images, similar to the PIFu method. The second network is a lightweight network that can take the higher, 1K-resolution input image to analyze the local details. By enabling access to global 3D information from the first level, our system can leverage local and global information efficiently for high-resolution 3D human reconstruction. You can see the qualitative results of our method compared with the state of the art below:

Such high-quality, fine-grained detailed 3D reconstructions could help enhance important applications like creating more realistic virtual reality experiences.

Read the full research paper HERE.

“Wish you were here”: Context-aware human generation

We’ve built a new system that can take an image of a person from one photo and then add it to a different image while maintaining the quality and semantic context of the scene interaction. It can generate an image of a person in the context of other people in an image, adjusting the source image so that their pose matches the new context. This is a challenging application domain since it is particularly easy to spot discrepancies between the novel person in the generated image and the existing ones. Unlike previous work on adding people to existing images, our method can be applied to a variety of poses, viewpoints, scales, and severe occlusions.

Our method involves three subnetworks. The first generates the structure of the novel person, the second renders a realistic person given the generated structure and an input target, and the third refines the rendered face. We’ve demonstrated high-resolution outputs in various experiments. We’ve also evaluated each network individually, demonstrating state-of-the-art results in pose transfer benchmarks as well as in other possible applications, such as drawing a person or replacing a person’s hair, shirt, or pants.

With the recent increased interest in remote events and interactions across locations, our research could make it easier for people to collaborate more naturally when using video tools or inspire new AR experiences.

Read the full paper HERE.

Source: Facebook AI

Related posts: