Skip to Content

Introducing

To continue the story from the last blog, were we get started how to get access to SAP Leonardo ML Foundation. And which steps are requierde to get allowed to call the API´s.

 

SAP Leonardo ML Foundation Architecture:

 

We want now focused on the upcomming lines to check and execute the retraining for the “image” callssifier with our own data.

I want focus on this blog the doing and not on ML in general. And futhermore u can use the “retraining” functionality not with a trial version!

 

Important: Currently only the “Image Classifier Service” can be used for the retraining.

 

Let´s start……

 

In general the retraining consists of the follwing four steps:

  1. Uploading the data for the training
  2. Executing the retraining job
  3. Deploy the model
  4. Execute the image classifier API

 

In detail we want

 

Please check pls also the SAP Help documentation.

 

Data, data, data

The first thing what wee need is for sure some data as our source which we want to use to train our new model.

Based on the fact, that hopefully the spring is not far away we just using some nice flower data ;o)

Another (the real) reason is that “Tensorflow” provides an archive for that and we want to start simple.

But anyway another good resouce to get other pictures is of course the Image Net or the Faktun Batch Download Picture plugin for chrome

As mentioned before we just starting by download flower archive from here to our local device.

A part of this data will be used later for our own “flower” model with the SAP Leonardo ML Image Classification service.

Get started and check the API

A good starting point is simply to enter the “retraining url” in a browser and have alook at Swagger UI to get an first idea which options we have:

In general we have three main parts for the retraining:

  • jobs
  • deloyments
  • models

 

Data preperation

Before we can execute one of the API´s we need to prepare our data and uplpoad them to AWS.

To start simple i´ve decided to reduce the amount of the data which comes with the archive which is provided by tensorflow. I think thre categories of flowers works.

For this create the following data structure:

+-- flowers
    +-- training
        +--roses
        +--sunflowers
        +--tulips
    +-- test
        +--roses
        +--sunflowers
        +--tulips
    +-- validation
        +--roses
        +--sunflowers
        +--tulips

As documented we need to structure the 3 folders “training”, “test” and “vaidation”.

Furhermore we split our source data into a 80-10-10 (~80% training, ~10% test and ~10 % validation).

Access the AWS object store

To get access to the object storage which runs on Amazon Webservice (AWS) we can using “minio” to operate directly with the S3 objectstore.

You can get the minio client here: link

Additional we can access the data also via UI.

For this and also the CLI access we need first to initialize (needs to be done only once) our file system by executing the follwing API call:

HTTP Method GET
URL <JOB_SUBMISSION_API_URL>
PATH /v1/storage/endpoint
HEADER Authorization (OAuth2 Access Token)

As response we get now something like this:

{
    "access_key": "<access key>",
    "endpoint": "<endpoint>.files.eu-central-1.aws.ml.hana.ondemand.com",
    "message": "The endpoint is ready to use.",
    "secret_key": "<secret key>",
    "status": "Ready"
}
The Minio UI

To get acces to the s3 store via the minio ui enter the URL and logon via the “acces key” and the “secret key”:

Afterwards we are able to see our bucket (data) with some data:

The CLI access

For the access via the CLI, we just starting here again with the authentification:

>mc.exe config host add saps3 https://<your endon aws s3>.files.eu-central-1.aws.ml.hana.ondemand.com <access key> <secret key>
Added `saps3` successfully.

And afterwards we now can using the “mc” command to e.g. list our data (buckets):

mc.exe ls <bucket>/<directory>

Update: Using “cyberduck”

Additional to the previous tools u can also use “cyberduck” to connect to your AWS S3 filesystem.

Creat a new AWS S3 connection by entering the required data:

As result u can access the data here:

Upload our data

Now its time to upload our “custom” data which we wan´t to use for our “retraining”.

The easiest way is to copy our files by executing the cp command:

mc.exe cp -r E:\0_SAPCP\8_ML\1_SAP_ML\0_Development\1_first_try\flowers saps3\data
...bc557236c7_n.jpg:  146.19 MB / 146.19 MB [================================================] 100.00% 484.60 KB/s 5m8s

Aferwards we can see our uploadad data on our AWS S3 bucket:

 

In the case something is going wrong u can also use the following command to delete your data / bucket:

mc.exe rm --recursive --dangerous --force saps3/data

A complete overview about all commands can be found by executing the “–help” parameter.

mc.exe --help

 

Time for the retraining….execute the job

As result that our data is know in place to exetute our training wen can now call the corresponding API:

Details:

HTTP Method  POST
URL <RETRAIN_API_URL>
PATH /v1/jobs
HEADER Authorization (OAuth2 Access Token)

And the following Body:

{
  "mode": "image",
  "options": {
    "dataset": "flowers",
    "modelName": "flowers-demo"
  }
}

As response we get now the “job id”:

{
    "id": "flowers-2018-02-15t0851z"
}

 

By executing the correspomding GET method we can retrieve the details and the status about the all “jobs”:

or only our new job:

We get something like this response:

{
    "processedTime": "2018-02-15T08:54:31.541131",
    "status": {
        "startTime": null,
        "submissionTime": null,
        "id": "flowers-2018-02-15t0851z",
        "finishTime": null,
        "status": "Pending/Scheduled"
    }
}
{
    "processedTime": "2018-02-15T08:57:34.844304",
    "status": {
        "startTime": "2018-02-15T08:57:33Z",
        "submissionTime": "2018-02-15T08:55:36Z",
        "id": "flowers-2018-02-15t0851z",
        "finishTime": null,
        "status": "Running"
    }
}

And finally u can see i took a while:

{
    "processedTime": "2018-02-15T09:03:15.181445",
    "status": {
        "submissionTime": "2018-02-15T08:55:36Z",
        "id": "flowers-2018-02-15t0851z",
        "startTime": "2018-02-15T08:57:33Z",
        "finishTime": "2018-02-15T09:02:32Z",
        "status": "Succeeded"
    }
}

 

Lets check the log´s

Before we start with the final deplyoment, we start we a short look at our AWS S3 filesystem.

And there we can now see some additional folders:

>mc.exe ls saps3/data/
[2018-02-15 10:04:34 CET]     0B flowers-2018-02-15t0851z\
[2018-02-15 10:04:34 CET]     0B flowers\
[2018-02-15 10:04:34 CET]     0B jobs\

If we now display the content of our “job id” folder.

mc.exe ls -r saps3/data/flowers-2018-02-15t0744z
[2018-02-15 10:02:31 CET]  12KiB retraining.log

And futhermore if we have a deeper look at the log file we get the information about the retraining:

mc.exe cat saps3/data/flowers-2018-02-15t0851z\retraining.log

Scanning dataset flowers ...
Dataset used: flowers
Dataset has labels: ['roses', 'sunflowers', 'tulips']
2228 images are used for training
180 images are used for validation
200 images are used for test
********** Summary for epoch: 0 **********
2018-02-15 09:00:08: Step 0: Train accuracy = 87.5%%
2018-02-15 09:00:08: Step 0: Cross entropy = 0.451392
2018-02-15 09:00:09: Step 0: Validation accuracy = 86.1%% (N=180)
2018-02-15 09:00:09: Step 0: Validation cross entropy = 0.437444
Saving intermediate result.
********** Summary for epoch: 1 **********
2018-02-15 09:00:13: Step 1: Train accuracy = 93.8%%
2018-02-15 09:00:13: Step 1: Cross entropy = 0.291782
2018-02-15 09:00:13: Step 1: Validation accuracy = 92.2%% (N=180)
2018-02-15 09:00:13: Step 1: Validation cross entropy = 0.320360
Saving intermediate result.
.....

At the end of this file we get the “Summary” about our training:

##########################################
########### Retraining Summary ###########
##########################################
Job id: flowers-2018-02-15t0851z
Training batch size  : 64
Learning rate : 0.001000
Total retraining epochs : 100
Retraining is stopped after 10 consecutive epochs which show no improvement in accurracy.
Epoch with best accuracy : 27
Best validation accuracy : 1.000000
Final test accuracy is : 0.985000
The exported model will predict top 3 classifications
Retraining started at: 2018-02-15 08:57:34
Retraining ended at: 2018-02-15 09:01:59
Restoring parameters from /home/model/interval-model-27
No assets to save.
No assets to write.
SavedModel written to: /home/model/tfs/saved_model.pb
TF Serving model saved.
Retraining lasted: 0:04:25.357850
Model is uploaded to repository with name flowers-demo and version 3.

 

A short explanation to the “Epoch” and “Bacth Size” terminology is here described: link

Epoch: One Epoch is when an ENTIRE dataset is passed forward and backward through the neural network only ONCE

Batch Size: Total number of training examples present in a single batch.

 

In the next blog we will continue the retraining by deploying the model and finally testing and executing our “new” model by adapting the standard “Image Classifier” API.

 

cheers,

fabian

 

Helpful Links


SAP Leonardo ML Foundation: https://help.sap.com/viewer/product/SAP_LEONARDO_MACHINE_LEARNING_FOUNDATION/1.0/en-US

Tensorflow flowers dataset: http://download.tensorflow.org/example_images/flower_photos.tgz

Minio Client: https://docs.minio.io/docs/minio-client-quickstart-guide

Tensorflow: https://www.tensorflow.org

Image net: http://image-net.org

Faktun Batch Downlaod Image: https://chrome.google.com/webstore/detail/fatkun-batch-download-ima/nnjjahlikiabnchcpehcpkdeckfgnohf?hl=en

Epoch vs Batch Size vs Iterations: https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9

 

 

To report this post you need to login first.

9 Comments

You must be Logged on to comment or reply to a post.

  1. Former Member

    Hallo Fabian,

    I tried this example in the Canary CF with standard ml-foundation plan.

    The JOB_SUBMISSION_API_URL for canary cf is : https://training.internalprod.eu-central-1.mlf-aws-prod.com. But, when I try <JOB_SUBMISSION_API_URL > /v1/storage/endpoint in the browser or via Post Man, I receive the error  “This server does not support the training service v1 API. If you are using the cf sapml plugin, please update to the latest version. “. Ideally I was expecting  “access_key” and “secret_key” among other fields in the response. Would you know why this happens?

    Thanks for your help!!!

     

    Thank you and Regards,

    Santosh

     

    (0) 
    1. Former Member

      Hi Santosh

      You are using a wrong URL.

      The URL should be : <IMAGE_RETRAIN_API_URL> /api/v2/image/retraining/storage

      Use Authorization as OAuth2 access token in header. This is the same token you got in the previous blog of Fabian in the this series.

      Thanks

      Biraj Das

      (1) 
      1. Former Member

        Hi Biraj,

        Thank you very much. It works now. I received access_key, end point and secret key.

        But, earlier as an alternative, I tried the command cf sapml fs config after installing sapml plugin for cf. The response is identical to <IMAGE_RETRAIN_API_URL> /api/v2/image/retraining/storage.

         

        Thank you & Regards,
        Santosh

        (0) 
  2. Former Member

    Hi Fabian,
    great blog! With the 1803 release, the SAP Leonardo Machine Learning Foundation offers newTensorflow model version (1.3) and updated Retraining APIs version (v2). Could you please include this ?
    Thanks a lot,
    Hannah

    (0) 
  3. Fabian Lehmann Post author

    Hi Hannah,

    ok, i´ve noticed there is a change….. it´s now updated on the BYOM blog.

    I´ve seen there was a change for the retraining api, but this it not active on our service instance.

    A ticket is just opened to clarify this.

     

    best,

    fabian

     

    (0) 
    1. Fabian Lehmann Post author

      Update: The new features like v2 API is currently not available at “productive”.

      SAP told me the 1803 release will be available in one ore two weeks.

       

      br,

      fabian

      (0) 
  4. Former Member

    Hello Fabian,

    I have a scenario which requires large volumes of images for retraining. Based on the split suggested for ‘training-test-validation(80-10-10)’  I have now 54k images for training, 6.7k each for test and validation.   Using minio client I was able to copy this data to s3. But, when I run the retrain API  (internal PROD) “https://mlfinternalproduction-retrain-image-api.cfapps.sap.hana.ondemand.com/api/v2/image/retraining/jobs“, I don’t see any exception, but after a while the status is returned as “Failed”. However, the retraining.log doesn’t have any information or exception regarding failure.  It has few low frequency WARNING for few labels. Don’t think this is the reason.

    Additionally, it has few messages of the form “Creating features for test/client image 1 – 64.”

    Is there any recommendation on no. of images to be uploaded “per label “? In some cases, I have only one image for training data for a given label and a maximum of 22k for another label.

    I don’t see retraining successfully completed message in the log. Do you have any suggestion on how to trouble shoot this issue?

    Thank you & Regards,

    Santosh

    (0) 
    1. Former Member

      Hi Santosh

      As per the Tensorflow guidelines you should have a minimum of 50 images in each category. Failing of which cause the retraining of the model to fail.

      Thanks

      Biraj Das

      (0) 
      1. Former Member

        Hi Biraj,

        Thank you for quick response. I tried maintaining the recommended set of images. But, still no change in the result. The retrain job runs with no exceptions or success message in the log. Just few info messages and the GET request on the job Id returns “Failed” status. I have over 20 labels each with different count of images like 605, 128, 22351, 905 etc… Could this be the reason for failure. This might sound strange, but from the logs I believe the retrain job is stopping abruptly with no information to act upon. Also tried different values for batchsize -1024(max possible). time, memory etc..but no change in the result.

         

        Thank you & Regards,

        Santosh

        (0) 

Leave a Reply