A team from Stanford and Duolingo has proposed a new way to manage the proficiency level in texts generated by large language models (LLMs), overcoming limitations in current methods. The Common European Framework of Reference for Languages (CEFR)-aligned language model (CALM) combines techniques of finetuning and proximal policy optimization (PPO) for aligning the proficiency levels of the output texts, to match the CEFR standards. This comes as a significant contribution in making text generation by AI more accessible and cost-effective.
Currently used techniques have limitations. The few-shot prompting method, where you guide the model’s output with a few examples, often results in high computational costs and less than satisfactory output when used with open-source models. Superficial finetuning needs a large labelled dataset, which may not always be readily available. Proximal Policy Optimization (PPO), a reinforcement learning technique that fine-tunes the model’s output based on a reward system, can be unstable and computationally heavy, making it unsuitable for large-scale applications.
CALM handles these limitations effectively by bridging the gap in performance between proprietary models like GPT-4 and open-source options. It uses a finetuning approach on open-source models LLama-2-7B and Mistral-7B, using the TinyTolkien dataset, made up of short stories at varying CEFR levels, which was generated by effective GPT-4 prompting strategies. Through PPO training, the model’s outputs align with the intended proficiency levels. In addition, a sampling strategy helps select the best output from multiple generations. The method uses linguistic features for automated CEFR scoring and reinforcement learning techniques to reduce ControlError, which assesses how much the generated text deviates from the target proficiency level.
The CALM model was found to be as effective as GPT-4 in controlling the error, yet significantly reducing costs. Evaluation metrics used included ControlError, QualityScore, and computational cost. The results were verified both through automatic scoring and a small-scale human study. For example, CALM with top-3 sampling achieved a ControlError of 0.15, outperforming other models.
Therefore, the researchers propose a unique method of controlling the proficiency level of texts generated by LLMs and provide a cost-effective solution for generating proficiency-controlled text. This approach opens up the potential to enhance applications in education and language learning and make advanced AI tools more accessible to a wider audience.