Large language models (LLMs), like Meta’s Llama and Amazon’s Pythia, are generally trained on broad, domain-agnostic datasets. However, recent research indicates that incorporating domain-specific datasets into the training process can significantly enhance LLM performance. This principle was demonstrated by incorporating 51% domain-specific financial documents into the training data of the BloombergGPT model, which outperformed traditional LLMs in finance-specific tasks.
LLMs obtain domain-specific knowledge either by training from scratch (costing over $100 million, as with GPT-4), continual pre-training (less costly but still requiring a substantial dataset), or instruction fine-tuning. Both instruction fine-tuning and Retrieval Augmented Generation (RAG) methods suffer from the drawback of not fully imparting domain-specific knowledge. The former risks degrading performance across unrelated tasks, while the latter does not acquire domain-specific language styles.
Continual pre-training acts as a cost-effective alternative that enriches domain knowledge without massively inflating expenses. It involves training an already functional open-domain LLM on domain-specific data only, and can be split into Domain-Adaptive Continual Pre-training (DACP) and Task-Adaptive Continual Pre-training (TACP).
Further enhancements include Efficient Task-Similar DACP (ETS-DACP) and Efficient Task-Agnostic DACP (ETA-DACP). ETS-DACP uses data closely matching the target tasks, even without a requirement for labeled data. ETA-DACP, for scenarios where task data is unavailable or broader domain models are required, uses novelty and diversity to extract the most valuable data samples.
The study’s authors found that ETS-DACP, ESA-DACP-ent, and ETS-DACP-com outperformed standard DACP, despite only using 10% of the data. The researchers also found that ETS-DACP registers the best performance on average, suggesting using unlabeled task data is still an effective approach for boosting task performance in the case of LLMs.
Importantly, the study found that continual pre-training did not negatively impact non-domain performance. Thus, developing domain-specific LLMs could offer productivity and efficiency gains in numerous fields, particularly for businesses with limited instruction tuning data or numerous downstream tasks.