Large Language Models (LLMs) have proven highly competent in generating and understanding natural language, thanks to the vast amounts of data they’re trained on. Predominantly, these models are used with general-purpose corpora, like Wikipedia or CommonCrawl, which feature a broad array of text. However, they sometimes struggle to be effective in specialized domains, owing to shifts in vocabulary and context.
To address this issue, researchers from IBM and NASA have collaborated to develop a new model that is applicable to specialized knowledge domains such as astronomy, biology, earth sciences, planetary sciences, astrophysics, and heliophysics. Different from existing models like SCIBERT, BIOBERT, and SCHOLARBERT, this new model, known as INDUS, encompasses all these related scientific fields.
The INDUS suite comprises several distinct models fit for different use-cases. It has an encoder model, which is specifically trained on domain-related vocabulary and corpora to facilitate tasks related to natural language understanding. It also includes a Contrastive-Learning-Based General Text Embedding Model, trained on multiple-source datasets to enhance performance in information retrieval tasks. Lastly, there are smaller versions of these models, created through knowledge distillation techniques for applications that require less latency or limited computational resources.
The group has also generated three scientific benchmark datasets to support advancements in these interdisciplinary research areas. These datasets include a climate change-related entity recognition database (CLIMATE-CHANGE NER), a NASA-related topic dataset for extractive question answering (NASA-QA), and a dataset focusing on NASA-related content for information retrieval tasks (NASA-IR).
To achieve this, the team developed INDUSBPE, a specialized tokenizer using the byte-pair encoding (BPE) technique. This tokenizer can effectively manage specific terminologies and languages used in the mentioned scientific fields. Moreover, they used this tokenizer along with selected scientific corpora to pretrain several encoder-only LLMs, creating sentence-embedding models through fine-tuning these pretrained models with a contrastive learning objective.
The team’s efforts have resulted in a more compact, efficient version of these models, engineered through knowledge-distillation techniques. This ensures that they perform exceptionally well even under resource constraints. Alongside, the team has introduced the three aforementioned scientific datasets to hasten research in interdisciplinary disciplines.
Testing has shown these models to perform excellently on both the newly created benchmark tasks and the current domain-specific benchmarks, surpassing domain-specific encoders like SCIBERT and general-purpose models like RoBERTa. In conclusion, INDUS represents a significant step forward in AI, providing researchers and professionals across numerous scientific fields with advanced tools for precise and efficacious Natural Language Processing tasks.