Large language models (LLMs) and latent variable models (LVMs) can present significant challenges during deployment, such as balancing low inference overhead and the rapid change of adapters. Traditional methods, such as Low Rank Adaptation (LoRA), often result in increased latency or loss of rapid switching capabilities. This can prove particularly problematic in resource-constrained settings like mobile devices.
LoRA, while an efficient technique during training and inference, often modifies a sizable proportion of the base model’s weights when adapting, leading to substantial memory and latency costs. This problem exacerbates when multiple adapters are deployed, with each overriding the other’s influence and diminishing the overall model’s performance. Existing sparse adaptation techniques struggle with rapid switching of adapters and concept retention.
Qualcomm AI developers have proposed a Sparse High Rank Adapters (SHiRA) framework to tackle these challenges. It alters only 1-2% of the base model’s weights, resulting in quick switching of adapters. This sparsity not only reduces the number of weight updates needed but also helps mitigate concept loss. The team used gradient masking during training to update only essential weights, retaining maximum performance with minimal changes. SHiRA’s design is lightweight, making ideal for mobile devices or other resource-constrained environments.
Researchers implemented the SHiRA framework using gradient masking, where a sparse mask dictates the trainable weights. Various strategies for generating these masks include random selection, weight magnitude, gradient magnitude, and Sensitivity-based Pruning (SNIP). The adapters can be switched readily by storing only the non-zero weights and indices and applying them via efficient scatter operations during inference. This application is memory- and latency-efficient, reducing GPU memory usage by 16% compared with standard LoRA.
Experimental testing of the SHiRA demonstrated its performance superiority on LLMs and LVMs. It consistently outperformed traditional LoRA methods, achieving up to 2.7% better accuracy in commonsense reasoning tasks and high image quality in style transfer tasks. This performance highlighted SHiRA’s ability to tackle the concept loss issues frequent in LoRA and its effectiveness in image generation quality. By changing only 1-2% of the base model’s weights, SHiRA allows for rapid adapter switching and limited inference overhead.
In conclusion, SHiRA addresses critical issues of rapid adapter switching and conceptual loss in multi-adapter settings while ensuring low inference overheads. Changing only 1-2% of the base model’s weights creates an effective, practical solution for deploying large models in resource-constrained environments. This revolutionary approach has significant implications for the field of AI research and deployment, potentially paving the way for further advancements and practical applications in the future.