Six Stages of Data-Centric MLOps

Changsin Lee
9 min readMar 12, 2022

What are the steps to ensure data quality in MLOps?

· 1. Scoping
· 2. Collecting
Privacy-protected
Trustworthy
Balance
Diversity
· 3. Labeling
· 4. Training
· 5. Deploying
· 6. Monitoring
· Conclusion
· Reference

Photo by MJ Tangonan on Unsplash

In the previous article, I argued that MLOps needs to operate around data given the historical development of AI. The details on how to manage data-centric MLOps are the focus of the current article.

Servicing an AI system in production requires an engineering approach. What that means is that the operations need to be systematic and repeatable with the necessary tools and processes.

A typical ML pipeline goes through six stages:

You will see that concerns for data need to be at every stage.

1. Scoping

At the Scoping stage, big questions need to be answered such as:

  • What problems do we need to solve?
  • Do we need AI, Machine Learning, or Deep Learning solutions?
  • What types of data would be good? (image, video, audio, text, numeric, structured, LiDAR, etc.)
  • What should the class labels be?
  • How many classes and how much data do we need for each class?
  • How do we measure performance?
  • What should we be the performance metrics?

After these initial questions are answered, we are ready to move to the most important phase of AI development: namely, collecting and labeling data.

2. Collecting

During the Collecting phase, the goal is to collect data that are privacy-protected, trustworthy, balanced, and diverse.

Privacy-protected

EU was the first to regulate the use of training data for AI through GDPR (General Data Protection Regulation), followed by China’s PIPL (Personal Information Protection Law)…

Changsin Lee

AI/ML Enthusiast | Software Developer & Tester | ex-Microsoftie | ex-Amazonian