The researchers from the University of Michigan have discovered something remarkable – prompting Large Language Models (LLM) with gender-neutral or male roles can elicit better responses than female roles! By experimenting with different prompts such as “You are a lawyer,” “You are speaking to a father,” and “You are speaking to your girlfriend,” the research team was able to determine which roles performed best.
The results of their study, published in a paper, showed that specifying a role when prompting can significantly improve the performance of LLMs by at least 20% compared to the control prompt. But when the roles were divided according to gender, it became apparent that gender-neutral or male roles outperformed female roles.
What could be causing this disparity? The researchers couldn’t provide a definitive answer, but it is likely that biases in the training data sets are surfacing in the models’ performances. Additionally, their tests revealed some other interesting findings. For instance, prompting with an audience prompt yielded better results than prompting with an interpersonal role. Moreover, the best results for the “police” role came from FLAN-T5, whereas the “mentor” and “partner” roles worked best in both models. Surprisingly, the “helpful assistant” role – which is so effective in ChatGPT – performed somewhere between 35 and 55 on the best roles list.
These subtle differences are definitely making a difference in the accuracy of the outputs, and understanding why could help us improve LLMs. So, let’s hope that some researchers with API credits to spare can replicate this research using ChatGPT. We would love to get confirmation of which roles work best in system prompts for GPT-4. And it’s likely that the results will be similarly skewed by gender. It’s an exciting prospect to explore!