Large language models (LLMs), such as GPT-3, are powerful tools due to their versatility. They can perform a wide range of tasks, ranging from helping draft emails to assisting in cancer diagnosis. However, their wide applicability makes them challenging to evaluate systematically, as it would be impossible to create a benchmark dataset to test a…
