Welcome to week 2 of Machine Learning in a Box!
Data Science Project Methodology
Before we get started
In case you are catching the train running, here is the link to the introduction blog of the Machine Learning in a Box series
First of all, I’d like to say thank you for the engagement in term of traffic and social media engagement related to this initiative. Now the message is clear (and I can start feeling the pressure on my shoulders)
Also here is the link to the SAP CodeTalk recording introducing the Machine Learning in a Box project with Ian Thain.
For those who never heard about SAP CodeTalk, they are 5 to 10 minutes long interview videos with real developers focusing on interesting projects and topics they are doing with SAP technology.
You can also find the SAP CodeTalk YouTube playlist here.
Quick recap from last week
Based on the provided articles, we can agree that Machine Learning is a subset of Data Science.
In my opinion, you can’t do machine learning without understanding some data science concepts. However, Machine Learning focuses on techniques that leverage the data to fine tune the use of algorithms whereas Data Science encompass a much broader scope, and include a wider spectrum of roles (Data Architect, Data Engineer, Statisticians, etc.)
And off course, if you have a different point of view let’s open the discussion.
Data Science Project Methodology
I know a lot of you guys are very eager to get started with installing software, playing with data, write some script, but before that we will need to set scene so that you will be successful running your first machine learning projects.
And to execute successfully and be repetitively successful, you will need to apply a methodology just like for any other kind of projects!
You cannot imagine how many times I saw people who started digging and mining data without identifying an objective, and guess what was the outcome… not really good.
There are a number of different project approaches that have been developed and constantly refined to ensure data science projects are reliable, repeatable and successful.
If we exclude the “home grown” project methodologies, the most commonly used methodologies are SEMMA and CRISP-DM.
As you will read, CRISP-DM is probably the way forward, as it will allow you to better engage with the “business” and implement an iterative approach that covers all phases including the deployment.
SEMMA, which stands for “Sample, Explore, Modify, Model and Assess”, is a popular project methodology developed by the SAS Institute.
The SEMMA process phases are the following:
The process starts with data sampling and partitioning.
The data set should be large enough to contain sufficient information to retrieve, yet small enough to be used efficiently.
This is the data understanding and discovery phase usually executed with a set of visualization.
It covers the discovery of anticipated and unanticipated relationships between the variables, but also anomalies.
|Modify||This is also known as data preparation and feature selection and engineering where you will select, create and transform variables.|
|Model||In this phase, the focus is on applying various modeling (data mining) techniques on the prepared variables in order to create models that possibly provide the desired outcome.|
|Assess||The evaluation of the modeling results that shows the reliability and usefulness of the created models.|
For reference, here is the Wikipedia page related to SEMMA: https://en.wikipedia.org/wiki/SEMMA
CRISP-DM, which stands for “CRoss-Industry Standard Process for Data Mining”, is the most commonly used approach by data mining experts and was introduced in 1996 by Daimler Chrysler (then Daimler-Benz), SPSS (then ISL) and NCR (Teradata).
The CRISP-DM phases are the following:
|Business Understanding||This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.|
|Data Understanding||The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.|
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data.
Data preparation tasks are likely to be performed multiple times, and not in any prescribed order.
Tasks include table, record, and attribute selection as well as transformation and cleaning of data for modeling tools.
|Modeling||In this phase, the focus is on applying various modeling (data mining) techniques on the prepared variables in order to create models that possibly provide the desired outcome.|
At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective.
Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the steps executed to construct the model, to be certain it properly achieves the business objectives.
A key objective is to determine if there is some important business issue that has not been sufficiently considered.
At the end of this phase, a decision on the use of the data mining results should be reached.
Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring or data mining process.
In many cases, it will be the customer, not the data analyst, who will carry out the deployment steps.
Even if the analyst deploys the model it is important for the customer to understand up front the actions which will need to be carried out in order to actually make use of the created models.
The arrows in the process diagram indicate the most important and frequent dependencies between phases.
The outer circle in the diagram symbolizes the cyclic nature of data mining process itself.
A data mining process continues after a solution has been deployed.
The lessons learned during the process can trigger new, often more focused business questions. Subsequent data mining processes will benefit from the experiences of previous ones.
(Source Wikipedia; https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining)
The diagram below depicts the different phases and the associated tasks:
CRISP-DM Phase 1 – Business Understanding
This phase focuses on understanding the project objectives and requirements from a business perspective, then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives which requires:
- Business Objective
- It will state, from a business perspective, what the client really wants to accomplish which usually address a business pain, like “Churn reduction to increase profit”.
- A proper statement will help uncover important factors, at the beginning, that can influence the outcome of the project.
- By neglecting this step, a great deal of effort might be spent producing the right answers to the wrong questions.
- Current Situation:
- Assessing the current situation also helps to uncover more details and facts about all the resources, constraints, assumptions and other factors that should be considered.
- Data Mining Goals
- Compared to the Business Objective, which states objectives in business terminology, the Data Mining Objectives states project objectives in data mining terms (algorithm families or techniques to be applied).
- For example, when the business goal is “Increase catalog sales to existing customers”, a data mining goal can be “Predict how many widgets a customer will buy, given their purchases over the past three years, demographic information (age, salary, city) and the price of the item”.
- This will be “the list of questions used to solve the problem”.
- Project Planning
- This should result in the production of a project plan which describes the intended plan for achieving the data mining goals and the business goals.
- The plan should specify the anticipated set of steps to be performed during the rest of the project including an initial selection of tools and techniques.
CRISP-DM Phase 2 – Data Understanding
This phase usually starts with an initial data collection, then proceeds with activities in order to get familiar with the data and finally verify that the data is appropriate for your needs.
You can then start identify data quality problems, discover first insights into the data or detect interesting subsets to form hypotheses for hidden information.
Here are the tasks to be completed during this phase:
- Collect data
- Outline data requirements from the project resources and verify data availability
- Acquire the data listed in the project resources.
- Describe data
- Examine the “gross” or “surface” properties of the acquired data.
- Report on the results.
- Explore data
- Tackles the data mining questions, which can be addressed using querying, visualization and reporting:
- Distribution of key attributes, results of simple aggregations.
- Relations between pairs or small numbers of attributes.
- Properties of significant sub-populations, simple statistical analyses.
- May address the data mining goals directly.
- May contribute to or refine the data description and quality reports.
- May feed into the transformation and other data preparation needed
- Tackles the data mining questions, which can be addressed using querying, visualization and reporting:
- Verify data quality
- Examine the quality of the data, addressing questions such as: “Is the data complete?”, “Are there missing values in the data?”
- Find Outliers
CRISP-DM Phase 3 – Data Preparation
This phase covers all activities related to the construction of the final dataset from the initial raw data.
The data preparation tasks are likely to be performed multiple times and not in any prescribed order.
These tasks include table, record and attribute selection as well as transformation and cleaning of data that will feed the modeling tools.
The data preparation phase covers the following tasks:
- Select data
- This is where you will decide which data you have access to and which one will be used in your data mining activities.
- You will have to articulate the rationale for using or not using the data (availability, volumes, quality etc…)
- Clean data
- Data are never perfect or clean, there is always inconsistencies, or missing value and sometime outliers.
- You’ll most likely need to “adjust” by excluding some cases or individual variables, by replacing some data with default values or with more sophisticated technique (like imputation).
- Construct and Integrate data
- This is where you will derive new variables, build aggregation or merge data set (tables).
- This task will be heavily influenced by the data cleaning you executed before.
- Format data
- Sometime the modeling tool require specific formats to be applied to your mining data set
CRISP-DM Phase 4 – Modeling
After all the preparation work, this is where you will start building models.
The modeling phase covers the following tasks:
- Select the modeling techniques
- This will be based on your data mining objectives, like classification, regression, times series etc.
- Sometime, multiple modeling techniques can be used to archive the same data mining goal.
- Design test(s) scenarios
- Before actually building a model, define the procedure or mechanism to test the model’s quality and validity.
- For example, in classification, it is common to use error rates as quality measures.
- Therefore, typically you will separate the dataset into a “learn” and “test” set, build the model on the “learn” set and estimate its quality on the “test” set.
- Build model(s)
- When building models, most tools give you the option of adjusting a variety of settings, and these settings have an impact on the structure of the final model
- Some model types can be easily translated in simple equation; others more complex may use more sophisticated formats.
- Model(s) assessment
- You can now interpret the models according to domain knowledge, the data mining success criteria and the desired test design.
- This open discussion with the business analysts and the domain experts.
- This phase only considers the models, whereas the “Evaluation Phase” also takes into account all the other results that were produced in the course of the project.
CRISP-DM Phase 5 – Evaluation
During this phase, you will evaluate the model(s) and review the previous steps to confirm that the business objectives are met.
At the end of this phase, you will be able to decide whether or not the produced results should be used (which might not be just scores, but also insights, issues etc.).
The evaluation phase covers the following tasks:
- Evaluate results
- Assess the degree to which the model meets the business objectives.
- Determine if there is some business reason why this model is deficient.
- Test the model(s) on real test applications if time and budget constraints permit.
- Assess any other data mining results generated.
- Identify additional challenges, information or hints for future directions.
- Reviewing the process
- Conduct a more thorough review of the data mining engagement in order to determine if there is any important factor or task that has been overlooked.
- Review any quality assurance issues.
- “Did we correctly build the model?”
- Determining the next steps
- Decide whether to complete the project and move on to deployment if appropriate or whether to initiate further iterations or set up a new data mining projects.
- Determine if there are any remaining resources or budget constraints that influences the decision.
CRISP-DM Phase 6 – Deployment
The knowledge gained during the project will now need to be organized and presented in a way that the customer can use.
However, depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise.
The deployment phase covers the following tasks:
- Planning deployment
- Deploy the data mining result(s) into the business.
- Document the procedure for future deployment.
- Planning monitoring and maintenance
- This is where the data mining results become part of the day-to-day business environment.
- Helps to avoid unnecessarily long periods of incorrect usage of data mining results.
- Requires a detailed monitoring process.
- Reporting final results
- The project leader and team write up a final report.
- This might only be a summary of the project and experiences.
- May be a final and comprehensive presentation of the data mining result(s).
- Reviewing project
- Assess what went right and what went wrong, what was done well and what needs to be improved
As you were reading (and I hope you enjoyed it), you probably realized how crucial it is to apply a robust methodology in order to be successful with your Data Science and Machine Learning projects.
Also, we saw that data activities can consume a significant amount of time (sometime up to 60-80% of the overall project effort). And there is a good reason for that!
There is something as worse as “answering the wrong question with the right data” which is “answering the right question with the wrong data”.
Ultimately, we should always focus on delivering what the customer needs.
If you want to get additional details about each phases and the associated tasks, I encourage you to have a look at the Getting Started with Data Science openSAP course from Stuart Clarke.
The above content summarize my TechEd 2016 session ANP160: “SAP Predictive Analytics – 101 Discovery Session” (with some updates)
Now, let’s see if we can keep the traction on with more thoughts, comments, contribution!!
My ask for this week is for you to share your opinion about the CRISP-DM methodology or any other methodology you have used so far. What benefits or bottlenecks do you see in implementing such methodology for you Machine Learning projects?
(Remember sharing && giving feedback is caring!)
UPDATE: Here are the links to all the Machine Learning in a Box weekly blogs:
- Introducing “Project: Machine Learning in a Box”
- Machine Learning in a Box (week 2) : Project Methodologies
- Recap Machine Learning in a Box (week 2) : Project Methodologies
- Machine Learning in a Box (week 3) : Algorithms Learning Styles
- Machine Learning in a Box (week 4) : Get your environment up and running
- Machine Learning in a Box (week 5) : Upload Machine Learning Datasets
- Machine Learning in a Box (week 6) : SAP HANA R Integration
- Machine Learning in a Box (week 7) : Jupyter Notebook
- Machine Learning in a Box (week 8) : SAP HANA EML and TensorFlow Integration