Together AI has introduced a new inference stack, marking a significant breakthrough in AI inference. This new stack has a decoding speed which is four times faster than the open-source vLLM, and outperforms industry-leading commercial solutions such as Amazon Bedrock, Azure AI, Octo AI, and Fireworks by a margin of 1.3x to 2.5x. The new inference stack, named Together Inference Engine, is equipped with the capability of processing over 400 tokens per second on Meta Llama 3 8B. It incorporates the newest innovations by Together AI, which include faster GEMM and MHA kernels, Flashattention-3, quality-preserving quantization, and speculative decoding techniques.
In addition to the inference stack, Together AI launched the Together Turbo and Together Lite endpoints. Beginning with Meta Llama 3 before expanding to other models, these endpoints allow businesses to balance performance, quality, and cost-efficiency. The Together Turbo offers performance that matches full-precision FP16 models very closely, establishing itself as the fastest engine for Nvidia GPUs and a cost-effective and highly accurate solution for generating AI on a production scale. The Together Lite endpoints utilize INT4 quantization for the most scalable and cost-efficient Llama 3 models, priced at only $0.10 per million tokens, which is six times cheaper than GPT-4o-mini.
Together AI’s new release includes key components such as Together Turbo Endpoints, Together Lite Endpoints, and Together Reference Endpoints. The Turbo Endpoints maintain high quality while offering fast FP8 performance, outperforming other FP8 solutions on AlpacaEval 2.0 by up to 2.5 points and costing 17 times less than the GPT-4o models. The Lite Endpoints offer cost-efficient and scalable Llama 3 models with excellent quality relative to full-precision implementations. The Reference Endpoints offer the fastest full precipitation FP16 support for Meta Llama 3 models, delivering performance up to 4 times faster than vLLM.
Named the Together Inference Engine, it incorporates numerous technical breakthroughs and ensures that even as it offers leading performance, quality isn’t compromised. Some of the advancements that have boosted its performance include proprietary kernels such as FlashAttention-3 and the most precise quantization techniques out there. Together AI has focused on providing cost efficiency with their new inference engine. For instance, the Together Turbo endpoints reduce costs by over 10 times compared to GPT-4o, and the Together Lite endpoints lead to cost reduction by 12 times compared to vLLM.
The Together Inference Engine is always integrating the latest innovations from Together AI’s in-house research and the broader AI community. Features such as FlashAttention-3 and speculative decoding algorithms like Medusa and Sequoia underscore its ongoing efforts to optimize its performance. Even with low precision, quality-preserving quantization maintains the performance and accuracy levels of its models. While providing modern businesses with the flexibility to scale their applications as per their performance, quality, and cost-efficiency requirements, Together AI says it looks forward to seeing the innovative applications that developers will build with these advanced tools.