Meta Sapiens Models: Enhancing Visual Data Understanding

| By:   Tamer Karam           |  Aug. 23, 2024

sapians

Meta has introduced a new family of vision models, the Sapiens models. They are designed for four key human-centric vision tasks: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. These models vary in the number of parameters, ranging from 300 million to 2 billion. They utilize a vision transformer architecture, where the tasks share the same encoder, while each task has a different decoder head.

These tasks are fundamental for various applications in computer vision, enhancing our ability to interpret and interact with visual data. Here is a brief description of each task

  1. 2D Pose Estimation: This task involves detecting and locating key points on a human body in a 2D image. These key points typically correspond to joints like elbows, knees, and shoulders, helping to understand the person’s posture and movements.
  2. Body-Part Segmentation: This task segments an image into different body parts, such as the head, torso, arms, and legs. Each pixel in the image is classified as belonging to a specific body part, which is useful for applications like virtual try-ons and medical imaging.
  3. Depth Estimation: This task estimates the distance of each pixel in an image from the camera, effectively creating a 3D representation from a 2D image. It’s crucial for applications like augmented reality and autonomous driving, where understanding the spatial layout is important.
  4. Surface Normal Prediction: This task predicts the orientation of surfaces in an image. Each pixel is assigned a normal vector, which indicates the direction the surface is facing. This information is valuable for 3D reconstruction and understanding the geometry of objects in the scene.

State of the Art Results

The Sapiens models have significantly improved the state-of-the-art results in these tasks:

  • 7.6 mAP improvement on Humans-5K for pose estimation
  • 17.1 mIoU improvement on Humans-2K for body-part segmentation
  • 22.4% relative RMSE improvement on Hi4D for depth estimation
  • 53.5% relative angular error improvement on THuman2 for surface normal prediction

The researchers attribute the state-of-the-art performance of the models to:

  1. Large-scale pretraining on a curated dataset: Pretraining on a vast and carefully curated dataset of human images allows the models to learn a wide range of human-centric features and patterns. This extensive pretraining helps the models generalize better to diverse and real-world scenarios, even when labeled data is scarce or synthetic.
  2. Scaled high-resolution and high-capacity vision transformer backbones: Using vision transformers with high resolution (1024 pixels) and a large number of parameters enhances the models’ ability to capture fine details and complex structures in images. This scalability ensures that the models can handle high-resolution inputs effectively, leading to more accurate and detailed predictions across various tasks.
  3. High-quality annotations on augmented studio and synthetic data: High-quality annotations provide precise and reliable ground truth for training the models. By using augmented studio and synthetic data, researchers can generate diverse and challenging scenarios that improve the models’ robustness and performance. This approach ensures that the models are well-equipped to handle real-world data with high accuracy.

Sapiens represents a significant advancement in human-centric visual data understanding. As open-source models, they can be fine-tuned for a multitude of downstream tasks, providing building on a high-quality vision backbones. This could accelerate the development of superior human-centric vision models, fostering innovation and progress in the field.


Share