Large Language Models (LLMs) are typically trained on large swaths of data and demonstrate effective natural language understanding and generation. Unfortunately, they can often fail to perform well in specialized domains due to shifts in vocabulary and context. Seeing this deficit, researchers from NASA and IBM have collaborated to develop a model that covers multidisciplinary fields such as Earth sciences, astronomy, physics, astrophysics, heliophysics, planetary sciences, and biology. This new model, dubbed INDUS, offers a more comprehensive approach than current models, like SCIBERT, BIOBERT, and SCHOLARBERT, which only partially cover some of these domains.
The INDUS suite comprises several types of models trained on carefully selected corpora from various sources. These include an Encoder Model designed to excel in tasks related to natural language understanding; a Contrastive-Learning-Based General Text Embedding Model, which increases information retrieval performance; and Smaller Model Versions created with knowledge distillation techniques potent in applications needing lower latency or limited computational resources. The research team has also produced three new scientific benchmark datasets for advancing research in these interdisciplinary domains, known as CLIMATE-CHANGE NER, NASA-QA, and NASA-IR.
INDUS employs the byte-pair encoding (BPE) technique to create a specialized tokenizer named INDUSBPE, which handles specialized terms and languages within the targeted scientific fields. This tokenizer enhances model comprehension and management of domain-specific language. With it, the team has trained numerous encoder-only LLMs pre-trained on a carefully selected scientific corpora. They’ve also developed sentence-embedding models by fine-tuning these pre-trained models with a contrastive learning objective.
In creating more efficient, smaller versions of these models, knowledge distillation techniques maintain their impressive performance even within resource-constrained situations. Their three new scientific benchmark datasets help expedite research in interdisciplinary disciplines. The NASA-QA dataset focusses on a question-answering task based on NASA-related themes; NASA-CHANGE NER targets an entity recognition task based on climate change-related entities; and NASA-IR is designed for information retrieval tasks inside NASA-related content.
Test findings reveal impressive performance of these models on both newly created benchmark tasks and established domain-specific benchmarks. Here, they’ve outperformed domain-specific encoders like SCIBERT, and general-purpose models like RoBERTa. Ultimately, INDUS represents a considerable step forward in AI, providing professionals and researchers in various scientific fields with powerful tools to undertake accurate and effective natural language processing tasks. The team’s research has been thoroughly documented in a publicly available paper.