MixedBread.ai, known for its work in artificial intelligence, has come up with a novel method called Binary Matryoshka Representation Learning (Binary MRL) for reducing the size of the memory footprint of embeddings used in natural language processing (NLP) applications. Embeddings are crucial to various functions in NLP such as recommendation systems, retrieval processes, and similarity searches. Despite their importance, their memory-intensive nature creates a significant concern, particularly while working with massive datasets.
In typical advanced models, the memory requirement for embeddings is excessively high because they produce high-dimensional outputs (e.g., of up to 1024 dimensions) in float32 format. This need for extensive memory storage and retrieval is what MixedBread.ai’s new Binary MRL method is designed to tackle.
The Binary MRL method combines two main approaches: Matryoshka Representation Learning (MRL) and Vector Quantization, to deal with the challenge. MRL primarily focuses on reducing the output dimensions of the embedding model without sacrificing accuracy. This is achieved by prioritizing more critical data in the initial dimensions, thereby allowing for truncation of less important dimensions. The Vector Quantization approach further reduces the size of each dimension by representing them as binary values instead of floating-point numbers.
The Binary MRL method, therefore, operates by first using MRL techniques to reduce the output dimensions of the embedding model. This involves training the model to retain salient information in lesser dimensions, and subsequently, truncating redundant dimensions. Next, Vector Quantization is employed to represent each dimension of this reduced-dimensional embedding as a binary value. This step ultimately reduces the memory footprint of embeddings significantly while retaining important semantic information. Evaluations of Binary MRL performed on different datasets show that this novel method can maintain over 90% of the original model’s performance while utilizing comparably minuscule embeddings.
This integration of MRL and Vector Quantization into the Binary MRL method is a significant step forward to address the memory limitations and scalability issues of embeddings in NLP applications. Besides reducing the costs of large-scale retrieval, the Binary MRL approach also enables the undertaking of new tasks which were hitherto not possible due to memory constraints.
Overall, the Binary MRL method ensures significant memory compression of NLP embeddings without compromising their utility and effectiveness. This novel approach has resulted in an over 98% (64x) reduction in the memory usage while retaining more than 90% of the model’s performance, transforming the scalability of vector search and embeddings-based applications.