NVIDIA and Amazon Web Services (AWS) have announced that their NVIDIA Inference Microservices (NIM) now integrates with Amazon SageMaker. This latest development provides users with the ability to deploy and optimize industry-leading large language models (LLMs). This new integration of technologies such as NVIDIA TensorRT, NVIDIA TensorRT-LLM, and NVIDIA Triton Inference Server will dramatically decrease deployment time from days to mere minutes. This service is hosted on NVIDIA accelerated instances backed by SageMaker.
NIM, listed on the AWS marketplace is part of NVIDIA’s AI Enterprise software platform, is a collection of inference microservices. It can harness the power of state-of-the-art LLMs for applications, offering capabilities such as natural language processing (NLP) and understanding. Users can develop applications like chatbots, document summarizers, or other NLP-driven apps using pre-built NVIDIA containers hosting optimized LLMs or create custom containers using NIM tools.
NIM provides pre-optimized engines for a wide variety of commonly used models for inference. It supports multiple LLMs like Llama 2, Mistral-7B-Instruct, Mixtral-8x7B, NVIDIA Nemotron-3 22B Persona, and Code Llama 70B. These LLMs use pre-built NVIDIA TensorRT engines curated for specific NVIDIA GPUs, utilizing optimal hyperparameters for maximum performance and ease of deployment.
If a model is not amongst the curated NVIDIA models, users can utilize NIM’s essential utilities like the Model Repo Generator. This tool facilitates the creation of a TensorRT-LLM-accelerated engine and a NIM-format model directory easily via a straightforward YAML file. NIM’s advanced hosting technologies, like in-flight batching, help in creating optimized LLMs for inference while maximizing the utilization of compute instances and GPUs.
When used with SageMaker, NIM enhances the hosting of user’s LLMs concerning performance and cost optimization. Users can enjoy SageMaker capabilities like scaling out instance numbers hosting their model, conducting blue/green deployments, evaluating workloads via shadow testing, and easy monitoring with Amazon CloudWatch.
Future developments for NIM include allowing Parameter-Efficient Fine-Tuning (PEFT) customization methods like LoRA and P-tuning. Support for Triton Inference Server, TensorRT-LLM, and vLLM backends will also be added. As a part of the NVIDIA AI Enterprise software subscription, NIM is available as a paid offering on the AWS Marketplace.
AWS and NVIDIA are also working on releasing an in-depth guide for NIM on SageMaker in the near future. They urge users to understand more about deploying LLMs using SageMaker via NVIDIA microservices and to try out the benefits made available through the integration.