In case you are catching the train running, here is the link to the introduction blog of the Machine Learning in a Box series which allow you to get the series from the start. At the end of this introduction blog you will find the links for each elements of the series.
Before we get started, a quick recap from last week
Last week, we saw how a project methodology could help you become successful with your Machine Learning projects.
Here is a link to a quick recap Machine Learning in a Box week 2 recap, I wrote before starting this one about Algorithms Learning Styles. You will find some personal thought about the CRISP-DM methodology.
Welcome to week 3 of Machine Learning in a Box!
Algorithms Learning Styles
When I started my journey at KXEN, I didn’t have a pedigree in data mining or data science. I have a tech support and programmer background. So, I have an understanding of what the word algorithm means in term of programing, and I discovered that for data science there is no difference.
When we were at school, we all solved algebra problems like “Find the equation of the line that passes through the points (-1 , -1) and (1 , 2)” or “Find the minima and maxima of the function f(x)=x4−8×2+5 and f(x)=x4−8×2+5”. And we did that manually… by applying an algorithm we learned during our classroom study.
And there are plenty of algorithms to help you solve a single type of problem, which in our Machine Learning project is usually represented by our data mining goal.
So, we need a way to organize our toolbox of algorithms. There are many ways to organize and group algorithm together, and here I will use something called the “Learning style”.
Using the “learning style” helps you think about how you will be preparing and using your data to build your model. Ultimately, you will try and pick the most appropriate algorithms to test and compare results.
Let’s take a look now at the main learning styles for machine learning algorithms, and the associated sub-categories.
With supervised learning, you will infer a function using a set of labeled data where the outcome (the target) is known. This dataset is also known as the training data set.
The training dataset can be represented as a pair consisting of an input vector of features (or variables, dimensions) and the associated output value,
Therefore, the goal of a supervised learning algorithm is to analyze the training data and produces a function that can score new input vector of features and get the predicted output value.
This will require the algorithm to generalize patterns (in the inferred function) from the training data in order to correctly determine the output value for any new and unseen input vector of features in a “reasonable” way.
There is a wide range of supervised learning algorithms, and they all come with their strengths and weaknesses. This implies there isn’t a “magic” algorithm that can address all supervised learning problems.
You can be further group supervised learning algorithms like this:
This is applicable when your target is represented as a category or a class, like “true” and “false” or “A” and “B” for a binary classification, or “A”, “B” and “C” for a multi class classification.
The following diagram depict a simple classification example where each icon is positioned based on its input value (x1 & x2 axis) and colored based on the output value. The inferred function is the green line (linear function here), and each question mark are new input that the inferred function will assigned to one side or the other.
This is applicable when your target is represented as a continuous number, like a financial revenue, a weight or a temperature.
The following diagram depict a simple regression example where each mark is positioned based on its input value (x axis) and the output value (y axis). The inferred function is the green line which can get you the “y” output value for any “x” input value.
- Time series forecasting
This is applicable when your training data set represent a signal or a series of value where you need to infer the next N values using the previous data.
Some people may argue that time series forecasting is a kind of regression, except that the inferred function for time series will produce a series of values instead of a unique value like in a regression.
In addition, the data set structure for time series requires an “order” column with unique values (usually a date, but could be an increment column in some cases).
The following example show a series of point at fix interval, the blue dots. The time series algorithm inferred function (the green line) represent a cosine function that can be used to predict the 5 next values (the red dots).
To summarize the big difference between a classification and a regression is the representation of the target variable (the output), where one is discrete (categories) and the other is continuous.
As opposed to supervised learning, with unsupervised learning, you will infer a function using a set of unlabeled data (no defined outcome).
Therefore, the inferring function is meant to describe hidden underlying structure and patterns or distribution in the data. Unlike supervised learning, there is no real way to evaluate the accuracy or relevance of the found structures and patterns.
You can be further group unsupervised learning algorithms like this:
This type of algorithm is applicable when you need to define groups of entities (a.k.a. clusters) based on the “similarity” or “distance” of the entity attributes compared to the overall distribution. Each clustering algorithms have their own grouping strategy either based on distance to a center, the group density, the group distribution etc. just like some will allow or prevent overlap, or the presence of residual items.
In the following example, the algorithm has defined 5 clusters using the distance to the center.
- Association rules
You can apply this type of algorithm when using transactional dataset linking items together or users to items, and your goal is to extract rules about the relation. A common rule example can be that you buy X when you buy Y. Off course, rules can be longer where multiple items are involved or enforce a certain sequence.
In the below example, a set of rules is extracted from a series of user shopping transaction.
Other Learning Styles
With semi-supervised learning, only a portion of the input data is labeled, which means that the algorithm must learn the structures to organize the data as well as make predictions.
It can become really expensive and time-consuming to label all your data, or worse they could be wrongly labeled.
If you take an image library as an example, only a small portion of the images will be labeled.
Therefore, both unsupervised and supervised techniques are leveraged to make the best use of unlabeled data by clustering them with labeled data or make best guess predictions, and use all that to build the model.
With reinforcement learning, the algorithm tries to find the “best ways” (a sequence of decisions or actions) to earn the greatest “reward”.
Typically, at every step a decision is taken in an environment that lead to a reward and a state. By performing this many times, the algorithm is able to learn how to improve its decisions and its ability to earn greater rewards.
To write this blog I leveraged several sources for inspiration, details, idea. I’ll try group them here by alphabetical order:
- Machine Learning 101 by Towards Data Science
- Machine Learning Explained by Ronald van Loon
I hope that this blog help clear some lingo around Machine Learning, and help you understand that these algorithms are here to help you to produce the best “functions” using your training data (labeled or not) that you can then apply to new sets of data.
Next week, we will start looking at what to install to get started. So, get your internet connection to download SAP HANA, express edition and some additional components and tools.
Quick question to you guys:
Would you find it useful to use slack to discuss this blog series and engage?
Any other proposal is welcome!
(Remember sharing && giving feedback is caring!)
UPDATE: Here are the links to all the Machine Learning in a Box weekly blogs:
- Introducing “Project: Machine Learning in a Box”
- Machine Learning in a Box (week 2) : Project Methodologies
- Recap Machine Learning in a Box (week 2) : Project Methodologies
- Machine Learning in a Box (week 3) : Algorithms Learning Styles
- Machine Learning in a Box (week 4) : Get your environment up and running
- Machine Learning in a Box (week 5) : Upload Machine Learning Datasets
- Machine Learning in a Box (week 6) : SAP HANA R Integration
- Machine Learning in a Box (week 7) : Jupyter Notebook
- Machine Learning in a Box (week 8) : SAP HANA EML and TensorFlow Integration