SAP Leonardo ML Foundation - Retraining part 1

fabianl · ‎02-15-2018

Introducing

To continue the story from the last blog, were we get started how to get access to SAP Leonardo ML Foundation. And which steps are requierde to get allowed to call the API´s.

1	Getting started with SAP Leonard ML Foundation on SAP Cloud Platform Cloud Foundry
2	SAP Leonardo ML Foundation - Retraining part1 (this blog)
3	SAP Leonardo ML Foundation - Retraining part2
4	SAP Leonardo ML Foundation - Bring your own model (BYOM)

SAP Leonardo ML Foundation Architecture:

We want now focused on the upcomming lines to check and execute the retraining for the "image" callssifier with our own data.

I want focus on this blog the doing and not on ML in general. And futhermore u can use the "retraining" functionality not with a trial version!

Important: Currently only the "Image Classifier Service" can be used for the retraining.

Let´s start......

In general the retraining consists of the follwing four steps:

Uploading the data for the training

Executing the retraining job

Deploy the model

Execute the image classifier API

In detail we want

Please check pls also the SAP Help documentation.

Data, data, data

The first thing what wee need is for sure some data as our source which we want to use to train our new model.

Based on the fact, that hopefully the spring is not far away we just using some nice flower data ;o)

Another (the real) reason is that "Tensorflow" provides an archive for that and we want to start simple.

But anyway another good resouce to get other pictures is of course the Image Net or the Faktun Batch Download Picture plugin for chrome

As mentioned before we just starting by download flower archive from here to our local device.

A part of this data will be used later for our own "flower" model with the SAP Leonardo ML Image Classification service.

Get started and check the API

A good starting point is simply to enter the "retraining url" in a browser and have alook at Swagger UI to get an first idea which options we have:

In general we have three main parts for the retraining:

jobs

deloyments

models

Data preperation

Before we can execute one of the API´s we need to prepare our data and uplpoad them to AWS.

To start simple i´ve decided to reduce the amount of the data which comes with the archive which is provided by tensorflow. I think thre categories of flowers works.

For this create the following data structure:

+-- flowers

    +-- training

        +--roses

        +--sunflowers

        +--tulips

    +-- test

        +--roses

        +--sunflowers

        +--tulips

    +-- validation

        +--roses

        +--sunflowers

        +--tulips

As documented we need to structure the 3 folders "training", "test" and "vaidation".

Furhermore we split our source data into a 80-10-10 (~80% training, ~10% test and ~10 % validation).

Access the AWS object store

To get access to the object storage which runs on Amazon Webservice (AWS) we can using "minio" to operate directly with the S3 objectstore.

You can get the minio client here: link

Additional we can access the data also via UI.

For this and also the CLI access we need first to initialize (needs to be done only once) our file system by executing the follwing API call:

HTTP Method	GET
URL	<JOB_SUBMISSION_API_URL>
PATH	/v1/storage/endpoint
HEADER	Authorization (OAuth2 Access Token)

As response we get now something like this:

{

    "access_key": "<access key>",

    "endpoint": "<endpoint>.files.eu-central-1.aws.ml.hana.ondemand.com",

    "message": "The endpoint is ready to use.",

    "secret_key": "<secret key>",

    "status": "Ready"

}

The Minio UI

To get acces to the s3 store via the minio ui enter the URL and logon via the "acces key" and the "secret key":

Afterwards we are able to see our bucket (data) with some data:

The CLI access

For the access via the CLI, we just starting here again with the authentification:

>mc.exe config host add saps3 https://<your endon aws s3>.files.eu-central-1.aws.ml.hana.ondemand.com <access key> <secret key>

Added `saps3` successfully.

And afterwards we now can using the "mc" command to e.g. list our data (buckets):

mc.exe ls <bucket>/<directory>

Update: Using "cyberduck"

Additional to the previous tools u can also use "cyberduck" to connect to your AWS S3 filesystem.

Creat a new AWS S3 connection by entering the required data:

As result u can access the data here:

Upload our data

Now its time to upload our "custom" data which we wan´t to use for our "retraining".

The easiest way is to copy our files by executing the cp command:

mc.exe cp -r E:\0_SAPCP\8_ML\1_SAP_ML\0_Development\1_first_try\flowers saps3\data

...bc557236c7_n.jpg:  146.19 MB / 146.19 MB [================================================] 100.00% 484.60 KB/s 5m8s

Aferwards we can see our uploadad data on our AWS S3 bucket:

In the case something is going wrong u can also use the following command to delete your data / bucket:

mc.exe rm --recursive --dangerous --force saps3/data

A complete overview about all commands can be found by executing the "--help" parameter.

mc.exe --help

Time for the retraining....execute the job

As result that our data is know in place to exetute our training wen can now call the corresponding API:

Details:

HTTP Method	POST
URL	<RETRAIN_API_URL>
PATH	/v1/jobs
HEADER	Authorization (OAuth2 Access Token)

And the following Body:

{

  "mode": "image",

  "options": {

    "dataset": "flowers",

    "modelName": "flowers-demo"

  }

}

As response we get now the "job id":

{

    "id": "flowers-2018-02-15t0851z"

}

By executing the correspomding GET method we can retrieve the details and the status about the all "jobs":

or only our new job:

We get something like this response:

{

    "processedTime": "2018-02-15T08:54:31.541131",

    "status": {

        "startTime": null,

        "submissionTime": null,

        "id": "flowers-2018-02-15t0851z",

        "finishTime": null,

        "status": "Pending/Scheduled"

    }

}

{

    "processedTime": "2018-02-15T08:57:34.844304",

    "status": {

        "startTime": "2018-02-15T08:57:33Z",

        "submissionTime": "2018-02-15T08:55:36Z",

        "id": "flowers-2018-02-15t0851z",

        "finishTime": null,

        "status": "Running"

    }

}

And finally u can see i took a while:

{

    "processedTime": "2018-02-15T09:03:15.181445",

    "status": {

        "submissionTime": "2018-02-15T08:55:36Z",

        "id": "flowers-2018-02-15t0851z",

        "startTime": "2018-02-15T08:57:33Z",

        "finishTime": "2018-02-15T09:02:32Z",

        "status": "Succeeded"

    }

}

Lets check the log´s

Before we start with the final deplyoment, we start we a short look at our AWS S3 filesystem.

And there we can now see some additional folders:

>mc.exe ls saps3/data/

[2018-02-15 10:04:34 CET]     0B flowers-2018-02-15t0851z\

[2018-02-15 10:04:34 CET]     0B flowers\

[2018-02-15 10:04:34 CET]     0B jobs\

If we now display the content of our "job id" folder.

mc.exe ls -r saps3/data/flowers-2018-02-15t0744z

[2018-02-15 10:02:31 CET]  12KiB retraining.log

And futhermore if we have a deeper look at the log file we get the information about the retraining:

mc.exe cat saps3/data/flowers-2018-02-15t0851z\retraining.log



Scanning dataset flowers ...

Dataset used: flowers

Dataset has labels: ['roses', 'sunflowers', 'tulips']

2228 images are used for training

180 images are used for validation

200 images are used for test

********** Summary for epoch: 0 **********

2018-02-15 09:00:08: Step 0: Train accuracy = 87.5%%

2018-02-15 09:00:08: Step 0: Cross entropy = 0.451392

2018-02-15 09:00:09: Step 0: Validation accuracy = 86.1%% (N=180)

2018-02-15 09:00:09: Step 0: Validation cross entropy = 0.437444

Saving intermediate result.

********** Summary for epoch: 1 **********

2018-02-15 09:00:13: Step 1: Train accuracy = 93.8%%

2018-02-15 09:00:13: Step 1: Cross entropy = 0.291782

2018-02-15 09:00:13: Step 1: Validation accuracy = 92.2%% (N=180)

2018-02-15 09:00:13: Step 1: Validation cross entropy = 0.320360

Saving intermediate result.

.....

At the end of this file we get the "Summary" about our training:

##########################################

########### Retraining Summary ###########

##########################################

Job id: flowers-2018-02-15t0851z

Training batch size  : 64

Learning rate : 0.001000

Total retraining epochs : 100

Retraining is stopped after 10 consecutive epochs which show no improvement in accurracy.

Epoch with best accuracy : 27

Best validation accuracy : 1.000000

Final test accuracy is : 0.985000

The exported model will predict top 3 classifications

Retraining started at: 2018-02-15 08:57:34

Retraining ended at: 2018-02-15 09:01:59

Restoring parameters from /home/model/interval-model-27

No assets to save.

No assets to write.

SavedModel written to: /home/model/tfs/saved_model.pb

TF Serving model saved.

Retraining lasted: 0:04:25.357850

Model is uploaded to repository with name flowers-demo and version 3.

A short explanation to the "Epoch" and "Bacth Size" terminology is here described: link

Epoch: One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE

Batch Size: Total number of training examples present in a single batch.

In the next blog we will continue the retraining by deploying the model and finally testing and executing our "new" model by adapting the standard "Image Classifier" API.

cheers,

fabian

Helpful Links

SAP Leonardo ML Foundation: https://help.sap.com/viewer/product/SAP_LEONARDO_MACHINE_LEARNING_FOUNDATION/1.0/en-US

Tensorflow flowers dataset: http://download.tensorflow.org/example_images/flower_photos.tgz

Minio Client: https://docs.minio.io/docs/minio-client-quickstart-guide

Tensorflow: https://www.tensorflow.org

Image net: http://image-net.org

Faktun Batch Downlaod Image: https://chrome.google.com/webstore/detail/fatkun-batch-download-ima/nnjjahlikiabnchcpehcpkdeckfgnohf...

Epoch vs Batch Size vs Iterations: https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9