Skip to Content
Technical Articles
Author's profile photo Evgeny Arnautov

Machine Learning inside Data Attribute Recommendation

TL;DR Data Attribute Recommendation uses dynamically created Deep Learning models (built via TensorFlow and trained on GPU)

Introduction

Whether you are considering the usage of the Data Attribute Recommendation service in production, going to compare it with other ML solutions or you are just a “deep dive” person – at some point of time, you will be interested in how Data Attribute Recommendation does predictions. In this blog post, I will give more technical details and answer the most common questions.

I assume that you are familiar with basic concepts of machine learning, if not please have a look at the links at the end of this blog post for further information.

 

Task

Data Attribute Recommendation solves single-label and multi-label classification problems, in supervised machine learning field. Supervised means that training data with labels (=correct values, target column) should be provided first in order to train a model. By multi-label, I mean that one model may be trained to predict multiple target columns at once independently (e.g. product type, production plant, and danger flag). Similarly, single-label means the model predicts only a single target column. These tasks are handled by separate templates that we provide through an API of the service

Features/Labels

Before uploading data to Data Attribute Recommendation, the user must specify its structure, in other words, name all the features and labels along with their types. This is called a data set schema.

Example

Dataset

Data Attribute Recommendation accepts data in UTF-8 encoded CSV format, with a semi-colon ; delimiter. In order to decrease the size of the transmitted data CSV file may be compressed with gzip tool. gz extension to be added in this case

Model templates

Before speaking about machine learning models, it makes sense to introduce another notion: model templates. In a nutshell, a model template is a set of data preprocessing steps and logic that creates a model. Those templates are a result of multiple co-innovation projects with customers from different industries and with different use cases.

Model%20template

Model template consists of preprocessing and model builder logic

So far Data Attribute Recommendation has three built-in model templates and both use artificial neural nets aka deep learning architectures that are feed-forward networks. This is the link to the current list of templates. In the future, additional templates with other approaches (RNN/CNN/Ensembles) may be added to the service upon need. 

The second part of a model template consists of model building logic that takes a data set schema as an input and generates an empty (not trained model) for it. The consequence is that for every dataset a unique model architecture is created.

Recently, Data Attribute Recommendation has added a new template to its offering; AutoML Template. AutoML Template automates a number of steps in the machine learning process, from data preparation and feature selection to model choice and model parameter optimisation. It can find the best preprocessing and modelling algorithms along with the best hyper-parameters for the given input data within its algorithm portfolio. 

 

Machine learning model

Let’s imagine you’ve uploaded your data along with a data set schema and it’s successfully validated. Also, you’ve chosen a model template that fits best to your needs. Now you are ready to train a model – that’s just another API call.  When you do this, Data Attribute Recommendation requests a GPU or CPU (if you are on trial or if you choose AutoML Template) container from the underlying platform and trains a model with your data. Here we also have some points to mention:

For  the deep learning templates:

  • First, TensorFlow is used as a framework for training and prediction.
  • Second, before starting the training, the service performs a random split based on label(s) values and divides the data into 3 parts: training, validation and test data splits in the proportion 80/10/10. After the model is trained, you are able to see metrics like accuracy or F1-score received on test part of data via GET-type API call on the model

It’s important to understand that the final model may be presented as a combination of training data, dataset schema, and model template.

For the AutoML Template:

  • First, AutoML Template trains several models, each of which uses a different combination of preprocessing algorithms or ML algorithms with a different set of hyper-parameters.
  • second, AutoML chooses the best ML model depending on the performance of the model on the validation set.

Final words

Initially, this post aimed to give a brief introduction into the machine learning part of Data Attribute Recommendation, but it’s a challenge to isolate only this area when speaking about comprehensive end-to-end implementation.

 

I hope this is useful to you. It would be great to receive your feedback/questions in the comments. If you’d like to read more on Data Attribute Recommendation, and be notified about new posts, follow Data Attribute Recommendation.

 

Useful links:

Assigned tags

      Be the first to leave a comment
      You must be Logged on to comment or reply to a post.