Amazon SageMaker has introduced a new capability that can help reduce the time it takes for the generative artificial intelligence (AI) models it supports to automatically scale. With this enhancement, the responsiveness of AI applications can be improved when demand becomes volatile. The emergence of foundation models (FMs) and large language models (LLMs) has brought new challenges to generative AI, often taking several seconds to process and sometimes only handling a limited number of simultaneous requests.
To combat these challenges, SageMaker offers endpoints for generative AI inference, reducing FM deployment costs by 50% and latency by 20% on average. SageMaker can deliver up to twice the throughput while reducing costs around 50% for generative AI performance. It provides streaming support for LLMs, enabling real-time streaming of tokens instead of waiting for a full response, enhancing the experiences of generative AI applications, such as conversational AI assistants.
To optimise real-time inference workloads, SageMaker uses Application Auto Scaling, which adjusts instance numbers and the quantity of model copies deployed in response to real-time changes in demand. If in-flight requests pass a certain threshold, auto scaling increases the available instances and deploys extra model copies to meet the heightened demand. This adaptive scaling ensures resources are optimally utilised, balancing performance needs with financial concerns.
SageMaker now emits two new sub-minute metrics : ConcurrentRequestsPerModel, that measures overall concurrency of requests per model, and ConcurrentRequestsPerCopy to measure concurrency when SageMaker real-time inference components are being used. These metrics give a more direct and accurate representation of the load on the system. This means that the models can scale more effectively and improve users experiences of using AI applications.
In conclusion, Amazon SageMaker’s new enhancements can allow users to improve the speed and responsiveness of generative AI applications. Application Auto Scaling helps to adapt and optimise the system to the changing demands, in real-time. Amazon encourages users to test these new metrics and evaluate the improvements on FM and LLM workloads on SageMaker endpoints.