A machine learning (ML) process must be reliable and repeatable by people with little ML background (citizen data scientists and business users) as well as by data scientists.
It’s extremely important that a project framework is used when a ML project is delivered. The framework should:
- Provide a basic structure for recording processes, challenges, and other experiences as the project progresses
- Allow projects to be replicated
- Provide an aid to project planning and management so that the project proceeds to a satisfactory conclusion
- Be a “comfort factor” for new adopters
- Reduce dependency on “stars” in the machine learning project team, so the project can be continued or replicated by any team member
The most commonly used framework is the Cross Industry Standard Process for Data Mining (CRISP-DM). This was an initiative launched in 1996, led by five companies: SPSS, Teradata, Daimler AG, NCR Corporation, and OHRA (an insurance company). Over 300 organization contributed to the process model.
The goal was to create a data-centric project methodology that was:
- Application/Industry neutral
- Tool neutral
- Focused on business issues as well as technical analysis
The CRISP-DM methodology is a hierarchical process model:
- At the top level, the process is divided into six different generic phases, ranging from business understanding to deployment of project results.
- The next level elaborates each of these phases, comprising several generic tasks. At this level, the description is generic enough to cover all data science scenarios.
- The third level specializes these tasks for specific situations. For example, the generic task might be cleaning data, and the specialized task could be cleaning of numeric or categorical values.
- The fourth level is the process—the record of actions, decisions, and results of an actual execution of a DM project.
The 6 generic phases are represented in the diagram:
- Confirm the project objectives and requirements from the business perspective.
- Define the data science approach that will answer the specific business objective.
- Initial data collection and familiarization
- Identify data quality problems
- Selection of data tables, record and attributes
- Undertake any data transformation and cleaning that is required
- Selection of modeling techniques
- Calibration of model parameters and model building
- Confirm that the business objectives have been achieved
- Deploy models and “productionise” if required
- Develop and implement a repeatable process that enables the organisation to monitor and maintain each model’s performance.
Of course, the process continues after a solution has been deployed. The lessons learned during the process can trigger new, often more focused business questions and subsequent data science processes will benefit from the experiences of previous ones. This is depicted by the circular arrow.
Since CRISP-DM was first conceived in 1996 there have been many changes, especially with Big Data, ML uses, and IT advancements. Large, rapidly changing data sets, streamed data, real-time output, and self-learning models were not part of the data mining landscape twenty years ago. Also, CRISP-DM doesn’t cover the important aspects of model deployment and monitoring in sufficient detail. Therefore, the requirements to deliver many modern ML projects don’t fit well with this old framework.
In my next blog, I’ll present a new framework that works well with modern ML projects in the Intelligent Enterprise.
For an in-depth look into the intelligent possibilities for your business, review the August 2018 Forrester Consulting study, Powering The Intelligent Enterprise With AI, Machine Learning, And Predictive Analytics, commissioned by SAP.