Skip to content Skip to footer

The book summarization capabilities of Claude 3 Opus surpasses all LLMs.

A collective team of researchers from the University of Massachusetts Amherst, Adobe, Princeton University, and the Allen Institute for AI have carried out a study to ascertain the accuracy and quality of summaries produced by Large Language Models (LLMs) when summarizing book-length narratives. The purpose of this research was to observe how well AI models are capable of summarizing content over 100,000 tokens long – equating to the length of a typical full-length novel.

For the purpose of this investigation, the researchers selected 26 books published in 2023 and 2024, ensuring that data contamination would not be present as a result of these books being included in the model’s original training data. Once the selected LLMs had summarized the book, GPT-4 was used to extract decontextualized claims from the summaries. These claims were then fact-checked by human annotators who had read the respective books in question.

Faithfulness Annotations for Book-Length Summarization (FABLES) is the dataset that was put together as a ledger of the results of this study. FABLES contains 3,158 claim-level annotations of faithfulness across all 26 narrative texts that were part of the test.

Despite various performing models, the results concluded that Claude 3 Opus was the most accurate book-length summarizer by a considerable distance. More than 90% of the claims made by Claude 3 were confirmed as faithful or accurate. In comparison, AI model GPT-4 was the runner-up with 78% of its claims verified as faithful. However, the study confirmed that AI models generally struggled to accurately summarize events or states pertaining to character and relationship development within the books.

In addition to problems with the accuracy of events, the AI models also had a tendency to leave out vital information in their summaries. More often than not, the models placed a higher emphasis on content towards the end of the books, often overlooking important content relayed at the start.

The accuracy of the fact-checking AI models was also put to the test. Usually, human annotators are employed to verify the claims made in book summaries, costing a total of $5,200 in this specific study. Simple fact retrieval is one of Claude 3’s key strengths, however, its ability to verify longer statements requiring a more in-depth understanding of the context was less stable.

When faced with the extracted claims and prompted to verify them, the AI models did not perform as well as the human annotators. This was especially apparent when attempting to identify unfaithful claims. While Claude 3 Opus was the standout claim verifier amid the AI models, the researchers still deemed it too unreliable to deliver a consistently accurate auto-rating.

In concluding their study, the researchers determined that humans still retain the edge when it comes to deeply understanding and summarizing the intricacies of human relationships, plot developments, and character motivations across a lengthy narrative. However, significant strides in AI automated book summarization was evident in models such as Claude 3 Opus.

Leave a comment

0.0/5