The Weather Company (TWCo) needed a robust Machine Learning Operations (MLOps) platform to support their growing data science team, and to create predictive, privacy-friendly machine learning (ML) models. The existing cloud environment lacked transparency for ML jobs and monitoring, making collaboration challenging. TWCo partnered with AWS Machine Learning Solutions Lab (MLSL) to enhance its MLOps platform using AWS services and solutions.
AWS provided TWCo with a full-stack of services to establish an MLOps platform in the cloud that is customizable to their needs. AWS services such as Amazon SageMaker, AWS CloudFormation, and Amazon CloudWatch were utilized. This initiative drastically reduced infrastructure management time by 90% and model deployment time by 20%.
AWS CloudFormation was used for infrastructure provisioning, AWS CloudTrail for account activity monitoring, Amazon CloudWatch collected and visualized real-time logs for automation. AWS CodeBuild functioned as a continuous integration service, AWS CodeCommit as the source control repository. AWS CodePipeline enabled automation of pipeline releases, Amazon SageMaker was used for exploring data, training, and deploying models, and Amazon S3 was used for storage.
Two primary pipelines were established, a Training pipeline and an Inference pipeline. The Training pipeline worked with features and labels stored as a CSV-formatted file on Amazon S3 and utilized components to preprocess, train, and evaluate. The Inference pipeline facilitated on-demand batch inference and monitoring tasks. The architecture was flexible and facilitated collaboration between different team roles like data scientists and ML engineers.
Model experimentation is one of the sub-components of the MLOps architecture and improvements in this area substantially enhance data scientists’ productivity and model development processes. The SageMaker project template plays a significant role in establishing a standardized and scalable infrastructure, eliminating the requirement for complex setups and management.
The process utilized the Amazon SageMaker SDK for training and deploying ML models on SageMaker, Boto3 SDK for creating roles and provisioning SageMaker SDK resources. SageMaker Projects delivered standardized infrastructure and templates for MLOps for rapid iteration, and Service Catalog simplified the process of provisioning resources at scale.
TWCo’s enhanced MLOps platform helped the company create ML models that improved user experience, and deciphered how weather conditions impact their daily routine or business operations. The partnership with AWS enabled TWCo to successfully scale up its data science operations, optimize their ML workflows, reduce model deployment time, and infrastructure management time. AWS’s scalable MLOps solutions offer efficient and effective ways to manage and deploy ML models at scale.