Nomic AI unveils the Nomic Embed, an open-source, auditable, and high-performing text embedding model with an extended context length. The release addresses the restricted openness and auditability of pre-existing models such as the OpenAI’s text-embedding-ada-002. Nomic Embed incorporates a multi-stage training pipeline based on contrastive learning and provides an 8192 context length, ensuring reproducibility and transparency.
Present long-context text embedding models, while state-of-the-art, suffer due to their closed-source nature and inaccessibility of training data for auditing purposes. The Nomic Embed aims to solve these problems by offering an open-source, auditable, and efficient text embedding model.
The model development process begins with training a BERT model with a 2048-token context length, known as nomic-bert-2048. The training is inspired by MosaicBERT and includes rotary position embeddings, SwiGLU activations, deep speed and FlashAttention, BF16 precision, a large vocabulary, and a batch size of 4096. The model is further contrastively trained with roughly 235M text pairs, supporting high-quality labelled datasets and hard-example mining.
The optimal performance of Nomic Embed is validated by benchmarks such as the Massive Text Embedding Benchmark (MTEB), LoCo Benchmark, and the Jina Long Context Benchmark. Nomic Embed not only surpasses closed-source models like OpenAI’s text-embedding-ada-002, but also trumps other open-source models in many standards.
A testament to Nomic AI’s commitment to openness in AI development is the release of model weights, training code, and curated data. The emphasis on transparency and reproducibility further solidify the model’s credibility. Nomic Embed’s performance in long-context tasks and the promotion for improved evaluation paradigms highlight its potential to advance the field of text embeddings. This development confirms Nomic Embed’s superiority over OpenAI Ada-002 and Text-Embedding-3-Small in both short and long context tasks.