Skip to content Skip to footer

Cleanlab presents the Reliable Language Model (TLM), a solution aimed at resolving the main obstacle to businesses adopting LLMs, which is their erratic outputs and hallucinations.

A recent Gartner poll highlighted that while 55% of organizations experiment with generative AI, only 10% have implemented it in production. The main barrier in transitioning to production is the erroneous outputs or ‘hallucinations’ produced by large language models (LLMs). These inaccuracies can create significant issues, particularly in applications that need accurate results, such as Air Canada’s chatbot giving incorrect information about refund policies or NYC’s “MyCity” chatbot providing wrong responses to legal queries.

Cleanlab is offering a solution to this problem: the Trustworthy Language Model (TLM). This tool addresses the primary obstacle to enterprise adoption of LLMs, namely unreliable outputs and hallucinations. TLM assigns a trust score to each LLM response, allowing users to identify and regulate faulty outputs, and thus making generative AI deployable in scenarios that were previously off-limits. Benchmarking has shown that TLM beats existing LLMs in terms of accuracy.

By assigning a trustworthiness score to each output, TLM addresses the presence of hallucinations in LLMs. The model places a priority on minimizing false negatives to ensure the trustworthiness score is low when hallucinations occur, facilitating the reliable deployment of LLM-based applications.

In addition to this, the TLM API can function as a seamless replacement for existing LLMs. It offers a method that returns responses and trustworthiness scores, which can enable new applications. TLM can also improve the accuracy of responses by generating multiple responses internally and selecting the one with the highest trustworthiness score. It can also increase the trust for outputs from existing LLMs or human-generated data through its .get_trustworthiness_score() method.

Cleanlab’s TLM was evaluated against OpenAI’s GPT-4, with a focus on response accuracy and cost/time savings. TLM enhances trust in LLM outputs, efficiently detecting errors. Compared to self-evaluation and probability-based methods, TLM offers a superior reliability assessment by including epistemic uncertainty.

Additionally, TLM optimizes resource spending by flagging low-scoring outputs for human review, ensuring a robust decision-making process. The tool has already provided significant cost savings for the Berkeley Research Group.

In conclusion, Cleanlab’s TLM serves as an extensive solution to organizations’ challenges in deploying LLM applications by assigning trustworthiness scores to outputs. This development in adopting generative AI is a significant step forward, paving the way for increased usage in an enterprise setting.

Leave a comment

0.0/5