Using data mining best practices to ensure optimal predictive flow
This video illustrates an example of how to build an end-to-end machine learned model using SAP Predictive Analysis.
Furthermore the video walks you through the aspect of training your model with respect to BIAS in your data.
The effect of incorrect sampling data from a BIAS sorted dataset is demonstrated. The dataset is based on the well known IRIS that is provided with R. Let me know if you would like a copy of the dataset so that you can try this yourself.
Finally the machine trained model is then applied to new data in order to perform predictions.
Sampling data is as illustrated in the video highly influent on the outcome of your data mining model.
In order to reduce bias in data how would you ensure that your data is picked random in both samples and not reused?
Looking at SAP Predictive Analysis the options for sampling are: First N, Last N, Every N, Simple Random or Systematic Random?
To reduce the risk of over-learning one must make sure that data are not reused across training, testing or validation.
That goes especially if data is sorting and hence would introduce a bias in the result.