Skip to content Skip to footer

Pioneering Advances in AI: The Role of Multimodal Large Language Models in Transforming Age and Gender Prediction

The evolution of Multimodal Large Language Models (MLLMs) has been significant, particularly those models that blend language and vision modalities (LVMs). There has been growing interest in applying MLLMs in various fields like computer vision tasks and integrating them into complex pipelines.

Despite some models like ShareGPTV performing well in data annotation tasks, their practical deployment is often hindered by high cost. Therefore, alternative cost-effective specialized models like MiVOLO might be preferrable. A comparison between general-purpose MLLMs and technical models like MiVOLO reveals significant differences in computational costs and speed for some tasks, such as labeling new data or filtering old datasets.

Researchers from SaluteDevices have developed MiVOLOv2, an upgrade to the original model. This improved version not only outperforms specialized models like CNN, ResNet34, and GoogLeNet, but also its first version. As a leading model for gender and age determination, MiVOLOv2 uses advanced evaluation metrics such as Mean Absolute Error (MAE) for age estimation, accuracy for gender prediction, and a cumulative Score at 5 (CS@5) for age estimation.

Differing from other models that make predictions based on prompts and images of body crops, MiVOLO employs face and body crops for predictions. It then uses a transformer to estimate age and gender from the given inputs. In addition to this, the researchers evaluated the capabilities of ChatGPT (ChatGPT4V) in predicting facial attributes and performing face recognition tasks. Even without training, the model excelled in age determination better than the specialized age-recognition model but had lower performance in gender classification.

The training dataset for MiVOLOv2 saw an extension by 40%, totaling over 807,694 samples. This includes a majority of the images where the initial version, MiVOLOv1, made significant errors. To achieve this, they primarily used production pipelines and open-source data like LAION-5B. The LAGENDA dataset, preferred over IMDB, minimized the risk of MLLMs providing correct answers due to familiarity rather than accurate age and gender estimation.

In conclusion, the paper makes a case for the effectiveness of MiVOLO2 as a strong competitor to MLLMs for age and gender estimation tasks. The research further highlights the potential of MiVOLOv2 by analyzing its increased versatility over standard MLLMs in age estimation and its refined capability to process images of individuals. It also leads to a broader evaluation of neural networks’ potential, including LLaVA and ShareGPT.

Leave a comment

0.0/5