Introduction

former_member788942 · ‎05-10-2022

Hello fellow bot builders!

Have you ever worked with huge and large Excel files (several thousand of rows and columns) and wanted to optimize the process of automating these files? In this blog post, I will present you the best practices to do that through different examples and use cases.

Introduction

A lot of automation starts by handling multiple use cases. Each one that could be a row in a table, an inner object within a JSON (JavaScript Object Notation) object, or a node in an XML file. The logic to handle such list of data is similar to getting the input data and then processing each use case one by one. This article will focus on Excel specificities.

When you need to handle large Excel files (or other files with data) on Cloud Studio, you need to optimize your automation. Optimization is often necessary because you can quickly run into problems with memory management or timeout.

Here are some steps to easily manage large Excel files with SAP Intelligent RPA.

Using the “Excel Cloud Link” activity is not always suitable for large and complex Excel files, you must modify certain elements, such as the Timeout, to optimize your automation and be sure to keep it running fast and uninterrupted.

Optimization methods

Splitting the Excel file line-by-line

The easiest but not the most efficient way to read an Excel file is to go through the file line-by-line, this allows you, for medium sized files, to read the data without too many constraints. A simple method of implementation would be the following:

Click on the automation to zoom

Note: It is important to use a sub-automation as above, to separate the parent job from the unit job that manages the current line.

In this implementation, our automation opens an Excel instance, retrieves the size of the Workbook (number of rows and columns) and loops, with an iteration counter, on each row of the table.

Like any automation that interacts with files, the execution time depends on the number of rows (or lines) of the file. In other words, this can become a problem if you interact with large files with the appearance of errors that stop the automation, like Timeouts or Agent disconnection errors. It is then necessary to choose and manage a specific Timeout for your Automations.

Choosing and managing a Specific Timeout for your Automations

Imagine that you have several thousand lines in your Excel file, and you need to read from or write into your Workbook. Each activity and automation have a Timeout, reading many lines and processing them may take some time. Thus, you will often need to modify the default Timeout of your automation.

By default, the maximum timeout for the full automation is 600 seconds (about 10 minutes) and 30 seconds for each asynchronous activity.

For your automation, you will be able to edit this Timeout duration directly from the Info section on the right-hand side panel. You have the possibility to change the maximum duration of the entire automation, as well as the duration of asynchronous processes.

It is interesting and important to calculate the maximum duration for the main activity but also for the sub-automations, in simulation or by doing unit tests.

Why does the Timeout value depend on your project needs?

Each project has its specific needs and Timeout management must depend on this. Four factors must be considered: Time, Infrastructure power, File Specificity, and number of Agents available.

The worst-case scenario being the highest number of rows being processed. Once you have created your use case processing automation, you should be able to assess how long it will take to handle a full file. Be careful to always take a value higher than the worst case to avoid a Timeout error, and test regularly.

Splitting the Excel file into different blocks of a variable size

If the needs of your project mean that managing your files line by line is not a viable solution, there is a method that allows you to optimize your bot: splitting the file into blocks.

Once your Timeout is correctly configured in the previous step, you can start to build your main automation from the Cloud Studio using Excel Application activities (irpa_excel package) like before for the line-by-line method. The tests validated by our teams are carried out with 1 million cells on Excel.

As previously explained, the “Excel Cloud Link” Activity is not optimized for handling and mapping large Excel files. Learn more about Excel Data Mapping on the SAP Intelligent RPA Documentation.

A simplified diagram of what you need to do will be, for example, the following:

Open the Excel Instance

Open the Workbook from the Excel Instance

Save the size of the Columns and Rows

Get the number of blocks you want (for example, with an input)

Browse your file by blocks using a loop

- Do what you want to do on it with a sub-automation (read / write)
- Add a new Job for each block (and sub-automation) to increase performance. Please see the Optimizing Automation by adding new Jobs simultaneously section)

Close the Excel Instance

For example, with a Cloud Studio automation we can represent this method as follows:

Click on the automation to zoom

Example: You have an Excel file of 50.000 lines, and you decide to split it into 500 blocks of 100 lines each, you will have a loop of 500 iterations which will increase the performance of your automation, instead of having 50.000 iterations. Estimating the automation execution time is very important, if this time exceeds a certain value, a Timeout appears, and you will have to take this into account.

This represents a significant time saving on Excel files with a lot of data. However, launching so many processes one after the other is not necessarily the best possible optimization for your automation. A complement is the addition of simultaneous job, see the section Optimizing Automation by adding new Jobs simultaneously for more information.

Optimizing Automation by adding new Jobs simultaneously with multiple Agents

When your automation is ready, let's see how to optimize the operations you do on the Excel file.

As seen previously, your Excel Automation splits your file into different blocks of data (in this case, the rows of your Excel file). These rows, or lines, are processed by block iteration for reading or writing data, but a solution that would considerably optimize your automation is to launch Jobs simultaneously.

A Job (or the operation you perform on your file) can be represented by a sub-automation which takes care for example of processing the data of the selected block. By launching several Jobs at the same time, your blocks are processed in parallel and independently of others. It also allows you not to have to wait for the end of an iteration to start processing the next iteration.

To trigger a sub-automation as a new Job, you can simply drag and drop the name of your automation from the Tools/Automations section on the right-hand side panel and check the box “Trigger as a new job”. This will automatically launch the job in parallel with the others.

Cloud Studio integrates a series of Activities in the Monitoring category. They allow you to measure the performance of your bot with a Timer to be integrated into the steps of the automations and sub-automations. You can consult the Business Activity Monitoring documentation for more information.

The good practice with this method is the use of several Virtual Machines (with an agent on each VM) at the same time. We can use a main file shared with all agents and VMs so that users know which lines have been processed in real time.

For more information about the Unattended Mode and the configuration of Virtual Machines, you can consult the documentation How to Set Up an Unattended Bot.

The use of a large number of VMs can generate an overload of the application on which you work, as well as for machine operating cost and environmental impact. Be sure to find the right balance according to your infrastructure.

How to manage concurrent accesses in Write mode?

Using different Agents working at the same time has its limitations: Depending on your environment architecture, there may be “simultaneous access issues”. These problems appear when you try to write simultaneously to the same file on a directory shared between several Agents, like multiple VMs. These limitations are caused by Windows, which does not allow several people to write to the same file.
However, this is not a problem in Read mode, and you can find solutions by, for example, importing the data from the shared file into a local file.

One More Thing

The management of large Excel files in your automations requires upstream thinking to adapt perfectly to your needs. Asking the question "What do I want to do with the file?" is very important. Do you want to modify it in real time? Retrieve data for further processing? Share data processing with multiple Agents? Every need is different and requires setting up your automation in a certain way, and that's what we've seen here.

The example presented here, with the block splitting method, is available on the Store, you can download and test it on your own project. Feel free to comment this blog post and watch the How to Manage Large Excel Files Tutorial Video on SAP Process Automation YouTube Channel.

Find more information on SAP Process Automation and RPA:

Easily Manage Large Excel Files with SAP Intelligent RPA's Cloud Studio

Introduction

Optimization methods

Splitting the Excel file line-by-line

Choosing and managing a Specific Timeout for your Automations

Why does the Timeout value depend on your project needs?

Splitting the Excel file into different blocks of a variable size

Optimizing Automation by adding new Jobs simultaneously with multiple Agents

How to manage concurrent accesses in Write mode?

One More Thing

Find more information on SAP Process Automation and RPA:

Get Your SAP HANA Idea Incubator Badge Today!

SCN Mission - SAP HANA Quiz Challenge is now retired

Share your #HANAStory and Win