Artificial intelligence, particularly large language models (LLMs), faces the critical challenge of balancing model performance and practical constraints such as privacy, cost, and device compatibility. Large cloud-based models that offer high-accuracy rely on constant internet connectivity, raising potential issues of privacy breaches and high costs. Deploying these models on edge devices introduces further challenges in maintaining low latency and high accuracy due to hardware limitations.
Existing models like Gemma-2B, Gemma-7B, and Llama-7B, and frameworks like Llama cpp and MLC LLM have aimed to improve AI efficiency and accessibility. Projects like NexusRaven, Toolformer, and ToolAlpaca have advanced function-calling in AI, resembling the efficacy of GPT-4. Techniques like LoRA have enabled fine-tuning under GPU constraints. However, achieving a balance between model size and operational efficiency presents a critical limitation, particularly for applications on constrained devices that require low-latency and high-accuracy.
Addressing this prevalent issue, Stanford University researchers introduced Octopus v2, an advanced on-device language model. This model differs from its predecessors in that it greatly reduces latency, enhances accuracy for on-device applications, and employs fine-tuning methods with functional tokens. This feature allows for precise function calling, surpassing GPT-4 in efficiency and speed, and notably reduces the context length requirement by 95%.
Octopus v2 was fine-tuned using a derivative of Google DeepMind’s Gemma 2B model, which boasts 2 billion parameters. The model was trained on a uniquely tailored dataset focusing on Android API calls, comprising of positive and negative examples to improve function calling precision. Stanford’s researchers used full model and Low-Rank Adaptation (LoRA) techniques during training to optimize on-device performance. The primary innovation of this model is the inclusion of functional tokens during fine-tuning, significantly reducing latency and context length requirements. This allows Octopus v2 to achieve high accuracy and efficiency of function calling on edge devices without necessitating a large amount of computational resources.
In benchmark tests, Octopus v2 outperformed GPT-4 by attaining a 99.524% accuracy rate in function-calling tasks. In addition, it showed a substantial reduction in response time with latency reduced to 0.38 seconds per call, representing a 35-fold improvement compared to earlier models. It also demonstrated efficiency in on-device operations, requiring 95% less context length for processing. The combination of these factors positions Octopus v2 as a significant advancement in on-device language model technology.
In essence, the development of Octopus v2 by the researchers at Stanford University presents a substantial improvement in on-device language modeling. By achieving high accuracy at 99.524% and reducing latency to just 0.38 seconds, Octopus v2 addresses key performance challenges of on-device AI. The model’s innovative fine-tuning approach with functional tokens considerably reduces context length, enhancing operational efficiency. The viability of Octopus v2 in real-world applications underlines the model’s potential in meeting the growing demands of efficient and performant on-device AI solutions.
The details and credit for this research are given in the published research paper. Follow updates from these researchers and their work on their social media channels, or join their newsletter, Reddit or Discord communities.