Estimating Facial Attractiveness Prediction for Livestreams

-

So far, Facial Attractiveness Prediction (FAP) has primarily been studied within the context of psychological research, in the sweetness and cosmetics industry, and within the context of cosmetic surgery. It is a difficult field of study, since standards of beauty are likely to be national reasonably than global.

Which means no single effective AI-based dataset is viable, since the mean averages obtained from sampling faces/rankings from all cultures can be very biased (where more populous nations would gain additional traction), else applicable to (where the mean average of multiple races/rankings would equate to no actual race).

As an alternative, the challenge is to develop and workflows into which country or culture-specific data may very well be processed, to enable the event of effective per-region FAP models.

The use cases for FAP in beauty and psychological research are quite marginal, else industry-specific; due to this fact many of the datasets curated up to now contain only limited data, or haven’t been published in any respect.

The simple availability of online attractiveness predictors, mostly geared toward western audiences, don’t necessarily represent the state-of-the-art in FAP, which seems currently dominated by east Asian research (primarily China), and corresponding east Asian datasets.

Source: https://www.semanticscholar.org/paper/Asian-Female-Facial-Beauty-Prediction-Using-Deep-Zhai-Huang/59776a6fb0642de5338a3dd9bac112194906bf30

Broader industrial uses for beauty estimation include online dating apps, and generative AI systems designed to ‘touch up’ real avatar images of individuals (since such applications required a quantized standard of beauty as a metric of effectiveness).

Drawing Faces

Attractive individuals proceed to be a precious asset in promoting and influence-building, making the financial incentives in these sectors a transparent opportunity for advancing state-of-the-art FAP  datasets and frameworks.

As an illustration, an AI model trained with real-world data to evaluate and rate facial beauty could potentially discover events or individuals with high potential for promoting impact. This capability can be especially relevant in live video streaming contexts, where metrics akin to ‘followers’ and ‘likes’ currently serve only as indicators of a person’s (or perhaps a facial type’s) ability to captivate an audience.

It is a superficial metric, in fact, and voice, presentation and viewpoint also play a big role in audience-gathering. Subsequently the curation of FAP datasets requires human oversight, in addition to the flexibility to tell apart facial from ‘specious’ attractiveness (without which, out-of-domain influencers akin to Alex Jones could find yourself affecting the typical FAP curve for a set designed solely to estimate facial beauty).

LiveBeauty

To deal with the shortage of FAP datasets, researchers from China are offering the primary large-scale FAP dataset, containing 100,000 face images, along with 200,000 human annotations estimating facial beauty.

Samples from the new LiveBeauty dataset. Source: https://arxiv.org/pdf/2501.02509

Source: https://arxiv.org/pdf/2501.02509

Entitled , the dataset features 10,000 different identities, all captured from (unspecified) live streaming platforms in March of 2024.

The authors also present FPEM, a novel multi-modal FAP method. FPEM integrates holistic facial prior knowledge and multi-modal aesthetic semantic features via a Personalized Attractiveness Prior Module (PAPM), a Multi-modal Attractiveness Encoder Module (MAEM), and a Cross-Modal Fusion Module (CMFM).

The paper contends that FPEM achieves state-of-the-art performance on the brand new LiveBeauty dataset, and other FAP datasets. The authors note that the research has potential applications for enhancing video quality, content suggestion, and facial retouching in live streaming.

The authors also promise to make the dataset available ‘soon’ – though it should be conceded that any licensing restrictions inherent within the source domain seem prone to pass on to the vast majority of applicable projects which may make use of the work.

The latest paper is titled , and comes from ten researchers across the Alibaba Group and Shanghai Jiao Tong University.

Method and Data

From each 10-hour broadcast from the live streaming platforms, the researchers culled one image per hour for the primary three hours. Broadcasts with the very best page views were chosen.

The collected data was then subject to several pre-processing stages. The primary of those is , which uses the 2018 CPU-based FaceBoxes detection model to generate a bounding box across the facial lineaments. The pipeline ensures the bounding box’s shorter side exceeds 90 pixels, avoiding small or unclear face regions.

The second step is , which is applied to the face region through the use of the variance of the Laplacian operator in the peak (Y) channel of the facial crop. This variance should be greater than 10, which helps to filter out blurred images.

The third step is , which uses the 2021 3DDFA-V2 pose estimation model:

Examples from the 3DDFA-V2 estimation model. Source: https://arxiv.org/pdf/2009.09960

Source: https://arxiv.org/pdf/2009.09960

Here the workflow ensures that the pitch angle of the cropped face is not any greater than 20 degrees, and the yaw angle no greater than 15 degrees, which excludes faces with extreme poses.

The fourth step is , which also uses the segmentation capabilities of the 3DDFA-V2 model, ensuring that the cropped face region proportion is bigger than 60% of the image, excluding images where the face is just not distinguished. i.e., small in the general picture.

Finally, the fifth step is , which uses a (unattributed) state-of-the-art face recognition model, for cases where the identical identity appears in greater than one among the three images collected for a 10-hour video.

Human Evaluation and Annotation

Twenty annotators were recruited, consisting of six males and 14 females, reflecting the demographics of the live platform used*. Faces were displayed on the 6.7-inch screen of an iPhone 14 Pro Max, under consistent laboratory conditions.

Evaluation was split across 200 sessions, each of which employed 50 images. Subjects were asked to rate the facial attractiveness of the samples on a rating of 1-5, with a five-minute break enforced between each session, and all subjects participating in all sessions.

Subsequently the whole lot of the ten,000 images were evaluated across twenty human subjects, arriving at 200,000 annotations.

Evaluation and Pre-Processing

First, subject post-screening was performed using outlier ratio and Spearman’s Rank Correlation Coefficient (SROCC). Subjects whose rankings had an SROCC lower than 0.75 or an outlier ratio greater than 2% were deemed unreliable and were removed, with 20 subjects finally obtained..

A Mean Opinion Rating (MOS) was then computed for every face image, by averaging the scores obtained by the valid subjects. The MOS serves because the ground truth attractiveness label for every image, and the rating is calculated by averaging all the person scores from each valid subject.

Finally, the evaluation of the MOS distributions for all samples, in addition to for female and male samples, indicated that they exhibited a Gaussian-style shape, which is consistent with real-world facial attractiveness distributions:

Examples of LiveBeauty MOS distributions.

Most people are likely to have average facial attractiveness, with fewer individuals on the extremes of very low or very high attractiveness.

Further, evaluation of skewness and kurtosis values showed that the distributions were characterised by thin tails and concentrated around the typical rating, and that within the collected live streaming videos.

Architecture

A two-stage training strategy was used for the Facial Prior Enhanced Multi-modal model (FPEM) and the Hybrid Fusion Phase in LiveBeauty, split across 4 modules: a Personalized Attractiveness Prior Module (PAPM), a Multi-modal Attractiveness Encoder Module (MAEM), a Cross-Modal Fusion Module (CMFM) and the a Decision Fusion Module (DFM).

Conceptual schema for LiveBeauty's training pipeline.

The PAPM module takes a picture as input and extracts multi-scale visual features using a Swin Transformer, and in addition extracts face-aware features using a pretrained FaceNet model. These features are then combined using a cross-attention block to create a personalised ‘attractiveness’ feature.

Also within the Preliminary Training Phase, MAEM uses a picture and text descriptions of attractiveness, leveraging CLIP to extract multi-modal aesthetic semantic features.

The templated text descriptions are in the shape of (where will be , , , or ). The method estimates the cosine similarity between textual and visual embeddings to reach at an attractiveness level probability.

Within the Hybrid Fusion Phase, the CMFM refines the textual embeddings using the personalized attractiveness feature generated by the PAPM, thereby generating personalized textual embeddings. It then uses a similarity regression technique to make a prediction.

Finally, the DFM combines the person predictions from the PAPM, MAEM, and CMFM to supply a single, final attractiveness rating, with a goal of achieving a sturdy consensus

Loss Functions

For loss metrics, the PAPM is trained using an L1 loss, a a measure of absolutely the difference between the expected attractiveness rating and the actual (ground truth) attractiveness rating.

The MAEM module uses a more complex loss function that mixes a scoring loss (LS) with a merged rating loss (LR). The rating loss (LR) comprises a fidelity loss (LR1) and a two-direction rating loss (LR2).

LR1 compares the relative attractiveness of image pairs, while LR2 ensures that the expected probability distribution of attractiveness levels has a single peak and reduces in each directions. This combined approach goals to optimize each the accurate scoring and the right rating of images based on attractiveness.

The CMFM and the  DFM are trained using an easy L1 loss.

Tests

In tests, the researchers pitted LiveBeauty against nine prior approaches: ComboNet; 2D-FAP; REX-INCEP; CNN-ER (featured in REX-INCEP); MEBeauty; AVA-MLSP; TANet; Dele-Trans; and EAT.

Baseline methods conforming to an Image Aesthetic Assessment (IAA) protocol were also tested. These were ViT-B; ResNeXt-50; and Inception-V3.

Besides LiveBeauty, the opposite datasets tested were SCUT-FBP5000 and MEBeauty. Below, the MOS distributions of those datasets are compared:

MOS distributions of the benchmark datasets.

Respectively, these guest datasets were split 60%-40% and 80%-20% for training and testing, individually, to take care of consistence with their original protocols. LiveBeauty was split on a 90%-10% basis.

For model initialization in MAEM, VT-B/16 and GPT-2 were used because the image and text encoders, respectively, initialized by settings from CLIP. For PAPM, Swin-T was used as a trainable image encoder, in accordance with SwinFace.

The AdamW optimizer was used, and a learning rate scheduler set with linear warm-up under a cosine annealing scheme. Learning rates differed across training phases, but each had a batch size of 32, for 50 epochs.

Results from tests

Results from tests on the three FAP datasets are shown above. Of those results, the paper states:

Ethical Considerations

Research into attractiveness is a potentially divisive pursuit, since in establishing supposedly empirical standards of beauty, such systems will tend to bolster biases around age, race, and plenty of other sections of computer vision research because it pertains to humans.

It may very well be argued that a FAP system is inherently to bolster and perpetuate partial and biased perspectives on attractiveness. These judgments may arise from human-led annotations – often conducted on scales too limited for effective domain generalization – or from analyzing attention patterns in online environments like streaming platforms, that are, arguably, removed from being meritocratic.

 

*

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x