Google’s AI research team has unveiled the ScaNN (Scalable Nearest Neighbors) vector search library, intended to address the growing need for efficient vector similarity search, a fundamental component of many machine learning algorithms. Current methods for calculating vector similarity are adequate for small datasets but as these datasets grow and new applications emerge, the requirement for improved performance and scalability becomes more pressing. The latest advancement to ScaNN comes in the form of SOAR (Spilling with Orthogonality-Amplified Residuals), a novel algorithm designed to enhance vector search speed, while also reducing required workload.
Most current ScaNN approaches utilise a clustering-based method where each vector in the dataset is allocated to a single k-means cluster. However, these methods experience some difficulties when the query vector is highly parallel to a residual, defined as the difference between a vector and its assigned cluster centre. This frequently leads to missed nearest neighbors, especially in scenarios where the query’s similarity to the cluster centre is not a true representation of its similarity to individual vectors within that cluster. SOAR provides a solution to this issue; by allowing vectors to be assign to multiple clusters, it introduces effective redundancy and modifies the loss function to stimulate independent and useful redundancy, ensuring that secondary clusters make a significant contribution to the search process.
Implementation of SOAR involves the assignment of vectors to numerous clusters and the use of a modified loss function to promote orthogonal residuals. The impact of this is a significant boost in search accuracy at a set computational cost, or a reduction in the search cost required to achieve equivalent accuracy. Trials reveal that the introduction of SOAR to ScaNN allows the library to retain its primary strengths, including low memory usage, fast indexing speeds, and hardware-friendly memory access patterns, whilst simultaneously gaining further algorithmic advantages. ScaNN incorporating SOAR has been shown to have query throughputs several times greater than similar libraries with equivalent indexing times. Indeed, in several benchmark tests (including the ann-benchmarks glove-100 dataset and the Big-ANN 2023 benchmarks), ScaNN with SOAR proved to be the superior choice for vector search performance.
To summarise, the paper introduces a potential solution in the form of SOAR to the challenge of carrying out efficient vector similarity searches with the ScaNN library. By incorporating redundancy and enhancing the assignment process, SOAR markedly heightens search accuracy and performance levels, without any negative impacts on core metrics such as memory usage and indexing speed. This breakthrough demonstrates the critical role of algorithmic innovation in meeting the continually escalating demands of machine learning applications regarding vector search.