The remarkable success of large-scale pretraining followed by task-specific fine-tuning for language modeling has established this approach as a regular practice. Similarly, computer vision methods are progressively embracing extensive data scales for pretraining. The emergence of huge datasets, corresponding to LAION5B, Instagram-3.5B, JFT-300M, LVD142M, Visual Genome, and YFCC100M, has enabled the exploration of a knowledge corpus well beyond the scope of traditional benchmarks. Salient work on this domain includes DINOv2, MAWS, and AIM. DINOv2 achieves state-of-the-art performance in generating self-supervised features by scaling the contrastive iBot method on the LDV-142M dataset. MAWS studies the scaling of masked-autoencoders (MAE) on billion images. AIM explores the scalability of autoregressive visual pretraining just like BERT for vision transformers. In contrast to those methods, which mainly give attention to general image pretraining or zero-shot image classification, Sapiens takes a distinctly human-centric approach: Sapiens’ models leverage an unlimited collection of human images for pretraining, subsequently fine-tuning for a spread of human-related tasks. The pursuit of large-scale 3D human digitization stays a pivotal goal in computer vision.
Significant progress has been made inside controlled or studio environments, yet challenges persist in extending these methods to unconstrained environments. To deal with these challenges, developing versatile models able to multiple fundamental tasks, corresponding to key popoint estimation, body-part segmentation, depth estimation, and surface normal prediction from images in natural settings, is crucial. On this work, Sapiens goals to develop models for these essential human vision tasks that generalize to in-the-wild settings. Currently, the most important publicly accessible language models contain upwards of 100B parameters, while the more commonly used language models contain around 7B parameters. In contrast, Vision Transformers (ViT), despite sharing an analogous architecture, haven’t been scaled to this extent successfully. While there are notable endeavors on this direction, including the event of a dense ViT-4B trained on each text and pictures, and the formulation of techniques for the stable training of a ViT-22B, commonly utilized vision backbones still range between 300M to 600M parameters and are primarily pre-trained at a picture resolution of about 224 pixels. Similarly, existing transformer-based image generation models, corresponding to DiT, use lower than 700M parameters and operate on a highly compressed latent space. To deal with this gap, Sapiens introduces a set of huge, high-resolution ViT models which can be pretrained natively at a 1024-pixel image resolution on hundreds of thousands of human images.
Sapiens presents a family of models for 4 fundamental human-centric vision tasks: 2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction. Sapiens models natively support 1K high-resolution inference and are extremely easy to adapt for individual tasks by simply fine-tuning models pretrained on over 300 million in-the-wild human images. Sapiens observes that, given the identical computational budget, self-supervised pre-training on a curated dataset of human images significantly boosts performance for a various set of human-centric tasks. The resulting models exhibit remarkable generalization to in-the-wild data, even when labeled data is scarce or entirely synthetic. The straightforward model design also brings scalability—model performance across tasks improves because the variety of parameters scales from 0.3 to 2 billion. Sapiens consistently surpasses existing baselines across various human-centric benchmarks, achieving significant improvements over prior state-of-the-art results: 7.6 mAP on Humans-5K (pose), 17.1 mIoU on Humans-2K (part-seg), 22.4% relative RMSE on Hi4D (depth), and 53.5% relative angular error on THuman2 (normal).
Recent years have witnessed remarkable strides toward generating photorealistic humans in 2D and 3D. The success of those methods is greatly attributed to the robust estimation of assorted assets corresponding to 2D key points, fine-grained body-part segmentation, depth, and surface normals. Nevertheless, robust and accurate estimation of those assets stays an energetic research area, and sophisticated systems to spice up performance for individual tasks often hinder wider adoption. Furthermore, obtaining accurate ground-truth annotation in-the-wild is notoriously difficult to scale. Sapiens’ goal is to offer a unified framework and models to infer these assets in-the-wild, unlocking a wide selection of human-centric applications for everybody.
Sapiens argues that such human-centric models should satisfy three criteria: generalization, broad applicability, and high fidelity. Generalization ensures robustness to unseen conditions, enabling the model to perform consistently across varied environments. Broad applicability indicates the flexibility of the model, making it suitable for a wide selection of tasks with minimal modifications. High fidelity denotes the power of the model to provide precise, high-resolution outputs, essential for faithful human generation tasks. This paper details the event of models that embody these attributes, collectively known as Sapiens.
Following insights, Sapiens leverages large datasets and scalable model architectures, key for generalization. For broader applicability, Sapiens adopts the pretrain-then-finetune approach, enabling post-pretraining adaptation to specific tasks with minimal adjustments. This approach raises a critical query: What style of data is best for pretraining? Given computational limits, should the emphasis be on collecting as many human images as possible, or is it preferable to pretrain on a less curated set to higher reflect real-world variability? Existing methods often overlook the pretraining data distribution within the context of downstream tasks. To review the influence of pretraining data distribution on human-specific tasks, Sapiens collects the Humans-300M dataset, featuring 300 million diverse human images. These un-labelled images are used to pre-train a family of vision transformers from scratch, with parameter counts starting from 300M to 2B.
Amongst various self-supervision methods for learning general-purpose visual features from large datasets, Sapiens chooses the masked-autoencoder (MAE) approach for its simplicity and efficiency in pretraining. MAE, having a single-pass inference model in comparison with contrastive or multi-inference strategies, allows processing a bigger volume of images with the identical computational resources. For higher fidelity, in contrast to prior methods, Sapiens increases the native input resolution of its pretraining to 1024 pixels, leading to roughly a 4× increase in FLOPs in comparison with the most important existing vision backbone. Each model is pretrained on 1.2 trillion tokens. For fine-tuning on human-centric tasks, Sapiens uses a consistent encoder-decoder architecture. The encoder is initialized with weights from pretraining, while the decoder, a light-weight and task-specific head, is initialized randomly. Each components are then fine-tuned end-to-end. Sapiens focuses on 4 key tasks: 2D pose estimation, body-part segmentation, depth, and normal estimation, as demonstrated in the next figure.
Consistent with prior studies, Sapiens affirms the critical impact of label quality on the model’s in-the-wild performance. Public benchmarks often contain noisy labels, providing inconsistent supervisory signals during model fine-tuning. At the identical time, it is necessary to utilize fine-grained and precise annotations to align closely with Sapiens’ primary goal of 3D human digitization. To this end, Sapiens proposes a substantially denser set of 2D whole-body key points for pose estimation and an in depth class vocabulary for body part segmentation, surpassing the scope of previous datasets. Specifically, Sapiens introduces a comprehensive collection of 308 key points encompassing the body, hands, feet, surface, and face. Moreover, Sapiens expands the segmentation class vocabulary to twenty-eight classes, covering body parts corresponding to the hair, tongue, teeth, upper/lower lip, and torso. To ensure the standard and consistency of annotations and a high degree of automation, Sapiens utilizes a multi-view capture setup to gather pose and segmentation annotations. Sapiens also utilizes human-centric synthetic data for depth and normal estimation, leveraging 600 detailed scans from RenderPeople to generate high-resolution depth maps and surface normals. Sapiens demonstrates that the mixture of domain-specific large-scale pretraining with limited, yet high-quality annotations results in robust in-the-wild generalization. Overall, Sapiens’ method shows an efficient strategy for developing highly precise discriminative models able to performing in real-world scenarios without the necessity for collecting a costly and diverse set of annotations.

Sapiens : Method and Architecture
Sapiens follows the masked-autoencoder (MAE) approach for pretraining. The model is trained to reconstruct the unique human image given its partial commentary. Like all autoencoders, Sapiens’ model has an encoder that maps the visible image to a latent representation and a decoder that reconstructs the unique image from this latent representation. The pretraining dataset consists of each single and multi-human images, with each image resized to a hard and fast size with a square aspect ratio. Just like ViT, the image is split into regular non-overlapping patches with a hard and fast patch size. A subset of those patches is randomly chosen and masked, leaving the remaining visible. The proportion of masked patches to visible ones, generally known as the masking ratio, stays fixed throughout training.
Sapiens’ models exhibit generalization across a wide range of image characteristics, including scales, crops, the age and ethnicity of subjects, and the variety of subjects. Each patch token within the model accounts for 0.02% of the image area in comparison with 0.4% in standard ViTs, a 16× reduction—providing fine-grained inter-token reasoning for the models. Even with an increased mask ratio of 95%, Sapiens’ model achieves a plausible reconstruction of human anatomy on held-out samples. The reconstruction of Sapien’s pre-trained model on unseen human images is demonstrated in the next image.

Moreover, Sapiens utilizes a big proprietary dataset for pretraining, consisting of roughly 1 billion in-the-wild images, focusing exclusively on human images. The preprocessing involves discarding images with watermarks, text, artistic depictions, or unnatural elements. Sapiens then uses an off-the-shelf person bounding-box detector to filter images, retaining those with a detection rating above 0.9 and bounding box dimensions exceeding 300 pixels. Over 248 million images within the dataset contain multiple subjects.
2D Pose Estimation
The Sapien framework finetunes the encoder and decoder in P across multiple skeletons, including K = 17 [67], K = 133 [55] and a brand new highly-detailed skeleton, with K = 308, as shown in the next figure.

In comparison with existing formats with at most 68 facial key points, Sapien’s annotations consist of 243 facial key points, including representative points across the eyes, lips, nose, and ears. This design is tailored to meticulously capture the nuanced details of facial expressions in the true world. With these key points, the Sapien framework manually annotated 1 million images at 4K resolution from an indoor capture setup. Just like previous tasks, we set the decoder output channels of the conventional estimator N to be 3, corresponding to the xyz components of the conventional vector at each pixel. The generated synthetic data can also be used as supervision for surface normal estimation.

Sapien : Experiment and Results
Sapiens-2B is pretrained using 1024 A100 GPUs for 18 days with PyTorch. Sapiens uses the AdamW optimizer for all experiments. The educational schedule features a transient linear warm-up, followed by cosine annealing for pretraining and linear decay for finetuning. All models are pretrained from scratch at a resolution of 1024 × 1024 with a patch size of 16. For finetuning, the input image is resized to a 4:3 ratio, i.e., 1024 × 768. Sapiens applies standard augmentations like cropping, scaling, flipping, and photometric distortions. A random background from non-human COCO images is added for segmentation, depth, and normal prediction tasks. Importantly, Sapiens uses differential learning rates to preserve generalization, with lower learning rates for initial layers and progressively higher rates for subsequent layers. The layer-wise learning rate decay is ready to 0.85 with a weight decay of 0.1 for the encoder.
The design specifications of Sapiens are detailed in the next table. Following a selected approach, Sapiens prioritizes scaling models by width somewhat than depth. Notably, the Sapiens-0.3B model, while architecturally just like the normal ViT-Large, consists of twentyfold more FLOPs resulting from its higher resolution.

Sapiens is fine-tuned for face, body, feet, and hand (K = 308) pose estimation using high-fidelity annotations. For training, Sapiens uses the train set with 1M images, and for evaluation, it uses the test set, named Humans5K, with 5K images. The evaluation follows a top-down approach, where Sapiens uses an off-the-shelf detector for bounding boxes and conducts single human pose inference. Table 3 shows a comparison of Sapiens models with existing methods for whole-body pose estimation. All methods are evaluated on 114 common key points between Sapiens’ 308 key point vocabulary and the 133 key point vocabulary from COCO-WholeBody. Sapiens-0.6B surpasses the present state-of-the-art, DWPose-l, by +2.8 AP. Unlike DWPose, which utilizes a posh student-teacher framework with feature distillation tailored for the duty, Sapiens adopts a general encoder-decoder architecture with large human-centric pretraining.
Interestingly, even with the identical parameter count, Sapiens models exhibit superior performance in comparison with their counterparts. As an illustration, Sapiens-0.3B exceeds VitPose+-L by +5.6 AP, and Sapiens-0.6B outperforms VitPose+-H by +7.9 AP. Throughout the Sapiens family, results indicate a direct correlation between model size and performance. Sapiens-2B sets a brand new state-of-the-art with 61.1 AP, a big improvement of +7.6 AP over the prior art. Despite fine-tuning with annotations from an indoor capture studio, Sapiens demonstrates robust generalization to real-world scenarios, as shown in the next figure.

Sapiens is fine-tuned and evaluated using a segmentation vocabulary of 28 classes. The train set consists of 100K images, while the test set, Humans-2K, consists of 2K images. Sapiens is compared with existing body-part segmentation methods fine-tuned on the identical train set, using the suggested pretrained checkpoints by each method as initialization. Just like pose estimation, Sapiens shows generalization in segmentation, as demonstrated in the next table.

Interestingly, the smallest model, Sapiens-0.3B, outperforms existing state-of-the-art segmentation methods like Mask2Former and DeepLabV3+ by 12.6 mIoU resulting from its higher resolution and huge human-centric pretraining. Moreover, increasing the model size further improves segmentation performance. Sapiens-2B achieves one of the best performance, with 81.2 mIoU and 89.4 mAcc on the test set, in the next figure shows the qualitative results of Sapiens models.

Conclusion
Sapiens represents a big step toward advancing human-centric vision models into the realm of foundation models. Sapiens models exhibit strong generalization capabilities across a wide range of human-centric tasks. The state-of-the-art performance is attributed to: (i) large-scale pretraining on a curated dataset specifically tailored to understanding humans, (ii) scaled high-resolution and high-capacity vision transformer backbones, and (iii) high-quality annotations on augmented studio and artificial data. Sapiens models have the potential to change into a key constructing block for a mess of downstream tasks and supply access to high-quality vision backbones to a significantly wider a part of the community.
