Recent advancements in large language models (LLMs), which have revolutionized fields like healthcare, translation, and code generation, are now being leveraged to assist the legal domain. Legal professionals often grapple with extensive, complex documents, emphasizing the need for a dedicated LLM. To address this, researchers from several prestigious institutions—including Equall.ai, MICS, CentraleSupélec, and Université Paris-Saclay—have introduced SaulLM-7B, the first publicly available LLM specifically designed for legal texts.
SaulLM-7B utilizes the backbone of the Mistral 7B model, a highly efficient open-source LLM with 7 billion parameters, and adapts it to the legal field. This process incorporates continued pretraining on a specifically curated 30-billion token legal corpus and fine-tuning legal instruction with both generic and legally specific instructions. Thus, SaulLM-7B-Instruct, the resulting model, is capable of precisely addressing legal queries and performing various legal tasks with ease and efficiency.
To create the model, researchers collected various legal texts, primarily focusing on English-speaking jurisdictions like the U.S., Europe, and Australia. The corpus compiled involved combining diverse datasets with data scraped from publicly accessible sources, resulting in an extensive 30-billion token data source. To uphold data quality, the team conducted rigorous cleaning and deduplication, filtering out noise and removing duplicate entries.
Experimental results show that SaulLM-7B-Instruct demonstrates an advanced comprehension of legal language and its application, surpassing the performance of non-legal models on LegalBench-Instruct and Legal-MMLU benchmarks. While to excel and demonstrate proficiency in legal-specific knowledge, SaulLM-7B-Instruct shows opportunities for enhancement in conclusion-drawing tasks that demand more deductive reasoning. Nonetheless, the model provides a robust foundation for developing legal workflows, highlighting its potential for further refinement.
In essence, the introduction of SaulLM-7B— a decoder model dedicated to legal materials that achieves top-notch performance in the legal domain among 7B models—holds exponential promise for the legal field. This pioneering model’s development process incorporates fine-tuning with legal data and synthetic datasets’ instruction fine-tuning. The researchers have also made a significant contribution to legal language processing by offering a cleaned version of LegalBench and introducing a new document set for perplexity measurement.
SaulLM-7B represents a significant step forward in the utilization of AI in the legal field, heightening expectation for what LLMs can achieve in the legal sector in the near future. The research team is hopeful this advancement will stimulate further development and enhance the execution of LLMs tailored toward legal applications.