Microsoft researchers have recently introduced a new technique for evaluating conversational AI assistants: RUBICON. This technique was specifically designed to assess domain-specific Human-AI conversations by generating and assessing candidate rubrics. Tested on 100 conversations between developers and a chat-based assistant specifically designed for C# debugging, RUBICON outperformed all other alternative rubric sets, demonstrating its high precision capabilities in predicting conversation quality.
Conversational AI assistants like GitHub Copilot Chat are challenging to evaluate due to their dependence on language models and chat-based interfaces. These tools make it difficult for software developers to assess their effectiveness accurately. Previous techniques used for analyzing user satisfaction often missed specific domain nuances, leading to discrepancies in assessment results. RUBICON, however, emphasizes the importance of context and task progression while also incorporating domain-specific signals and Gricean maxims, thus improving evaluation accuracy for domain-specific dialogues.
RUBICON estimates conversational quality by generating rubrics that determine user satisfaction and dissatisfaction from labeled conversations. The process includes generating diverse rubrics, selecting an optimized rubric set, and scoring conversations. RUBICON uses a supervised extraction and summarization process for rubric generation and a selection method that optimizes for precision and coverage. The correctness and sharpness losses help select the optimal rubric subset, thereby providing effective and accurate conversation quality assessment.
However, the current version of RUBICON is not without its limitations. The lack of diversity in the test dataset, for example, may impact its generalizability to other application domains. Additionally, the subjective nature of manually assigned labels might upset internal validity, even with high inter-annotator agreement. The assumption made by converting Likert scale responses into a [0, 10] scale may also present construct validity issues as it depends on an automated scoring system. The researchers have said that future work will address different calculation methods for the NetSAT score to overcome these limitations.
Despite these challenges, RUBICON has shown promise in its early stages of deployment, demonstrating its ability to differentiate conversation effectiveness and enhance the quality of rubrics. Its success paves the way for more accurate evaluation of conversational AI assistants, particularly in specialized application domains.