What are the steps to ensure data quality in MLOps?
In the previous article, I argued that MLOps needs to operate around data given the historical development of AI. The details on how to manage data-centric MLOps are the focus of the current article.
Servicing an AI system in production requires an engineering approach. What that means is that the operations need to be systematic and repeatable with the necessary tools and processes.
A typical ML pipeline goes through six stages:
You will see that concerns for data need to be at every stage.
At the Scoping stage, big questions need to be answered such as:
- What problems do we need to solve?
- Do we need AI, Machine Learning, or Deep Learning solutions?
- What types of data would be good? (image, video, audio, text, numeric, structured, LiDAR, etc.)
- What should the class labels be?
- How many classes and how much data do we need for each class?
- How do we measure performance?
- What should we be the performance metrics?
After these initial questions are answered, we are ready to move to the most important phase of AI development: namely, collecting and labeling data.
During the Collecting phase, the goal is to collect data that are privacy-protected, trustworthy, balanced, and diverse.