Skip to content Skip to footer

The growth and advancements in machine learning (ML) models have led to huge models that require a significant amount of computational resources for training and inferencing. Consequently, monitoring or observing these models and their performance is crucial for fine tuning and cost optimization. AWS has developed a solution to this using some of its tools and services.

The AWS CDK Observability Accelerator, a collection of modules designed for setting up observability for Amazon EKS clusters, guides users on how to monitor performance of ML chips with Amazon Elastic Kubernetes Service (Amazon EKS) cluster and Amazon Elastic Compute Cloud (Amazon EC2) instances. The AWS CDK Observability Accelerator is structured around patterns that act as deployable units for multiple resources. In addition, it incorporates Amazon Managed Grafana dashboards, an AWS Distro for OpenTelemetry collector for metrics collection, and Amazon Managed Service for Prometheus for metrics storage.

In this solution, an Amazon EKS cluster is set up with a node group comprising of Inf1 instances which use the Amazon EKS accelerated Amazon Linux AMI. The solution deploys the AWS Neuron device plugin to access the ML chips from Kubernetes, and uses the Neuron software tools to expose metrics to Amazon Managed Service for Prometheus. These metrics are presented in Amazon Managed Grafana via a corresponding dashboard.

There are several prerequisites for this setup such as the AWS Command Line Interface (AWS CLI), Node and NPM, and certain commands that need to be run to set up the environment.

Once the solution has been deployed, users can validate it by running the update-kubeconfig command and verifying the resources created. They can also confirm that the neuron-device-plugin-daemonset and neuron-monitor DaemonsSet are running.

The Grafana Neuron dashboard can be used to visualize data from within an Amazon Managed Grafana workspace. If the dashboards are accidentally deleted, they are automatically re-provisioned.

This solution simplifies the process of implementing observability in an EKS cluster with EC2 Inf1 instances, using open source tools. It offers a streamlined approach to deploying the Neuron device plugin, and for telemetry data collection and mapping. Metrics are sourced from Amazon Managed Service for Prometheus and displayed on the Neuron dashboard of Amazon Managed Grafana.

Leave a comment

0.0/5