Building MLOps pipeline for intensive ML and AI projects

AI AWS cloud-agnostic ML MLOps MVP PoC SageMaker serverless

Our Clients

Client Background

Data Science team of a global company that helps customers, including global ones, to improve their marketing activities. Data Science team has been created to research new opportunities within Artificial Intelligence and Machine Learning (AI/ML) domain for marketing optimizations.

Business Challenge

The Data Science team has senior resources focused on solving ML/AI challenges in an experimental way and reached promising experimental results. The business pushes the team to deliver “sellable” products rather than experimental ones only. The team faces multiple challenges during the release cycle and fights against the same problems over and over again.

Challenges by names:

Manual operations are the primary way of doing anything including data preparation, training, deployments, etc.
Quality is a constant fight because the QA process is manual and time-consuming
No “real” time-lines because no one can predict the next obstacle to overcome
The released version is very ephemeral because no apparent artifacts of the process are defined
Process is not stable (repeating steps does not bring repetitively same result)

Value Delivered

Project status:

successfully delivered Proof-of-Concept (PoC) phase to demonstrate the potential of MLOps practices
Minimum Viable Product (MVP) delivered to automate basic train – evaluate – deploy flow

The overall project timeline:

PoC phase – 3 weeks
MVP phase – 2 months

Value delivered:

PoC phase includes:
- analysis of existing processes and challenges
- research and selection of best fitting tools with opportunities for cloud agnostics in mind
- prototype of the MLOps pipeline to cover training, validation, and deployment for models
MVP phase includes:
- pipeline for model training, evaluation, and candidate registration
- approval process for candidate models
- pipeline for model deployment into production and exposure as HTTP API endpoint
- training and education

Circuit board and AI micro processor, Artificial intelligence of digital human. 3d render

SUCCESS STORY IN DETAILS

Background

ML and AI-based features become “must have” options for any industry-leading products. A common situation is when the small team is created as an experimental one, quickly shows the results sold to the customer, and then requires the “growth” of functionality. That is where all the problems begin if you do not have a proper Software Development Life-Cycle (SDLC) that can guarantee the repeatability and predictability of the results.

So, the objective of the MLOps pipeline is to bring that “repeatability and predictability” for SDLC and remove manual routine along with cost-reduction due to time-to-market speed improvements.

The article Machine Learning operations maturity model shows the typical path a data science team would pass to build a mature MLOps practices. We have started at Level 0 and targeted Level 3 out of 4.

Solution

Taking into account that the objective was to help to streamline the team’s work, we have proposed to focus on primary flow, common for any data science project: train – validate – deploy. The expectation was to build a template the team can re-use across multiple projects with an ability to adapt it to a specific case, if necessary. Another important concern was to make it cloud-agnostic as much as possible.

The decision was made to use AWS SageMaker as long as the AWS platform has been widely adopted within the client’s projects.

The template has been created to cover the following activities:

Git-based machine learning project with proper auto-triggers
Model training with proper training and test data sets management
Auto model evaluation to ensure proper quality
Candidate Model registration with manual Approval for promotion into Production
Auto-deployment to the Production environment with corresponding HTTP API endpoint exposure
Serverless inference execution to reduce costs during experiments and testing

As a result, the team has received a structured approach for model refinement with predictable timings for basic operations including experiments and testing. The pipeline has well-defined artifacts with corresponding version control and release management. Any model version could be deployed at any moment of time in any number of instances so that advanced experiments and testing are possible (for instance, A/B testing).

Lessons learned

MLOps pipelines are complex to build but it pays back for invested efforts
Popular cloud providers have out-of-the-box solutions which could be adapted for the specific needs of your project
MLOps as a culture can help to stabilize and structure the work of the Data Science team, improve its performance

Geo-distributed data sync solution for large volume of complex data