SAP Intelligent RPA – Multi-Format PDF Extraction
Co-Authored with Prasanth Padmanabhan Menon
Introduction and Business Requirement
Many businesses have manual processes to handle transformation of data from PDF into their ERP systems. Daily, these businesses receive numerous emails containing PDF documents with information that must be read, transformed, and brought into their ERP systems. These could include Supplier Invoices, Bank Details, Receipts, Credit Memos, among others. For the large volume of information in these documents to make its way into a system, an enormous amount of time investment, resources and repetitive manual effort is necessary.
With SAP Intelligent RPA, this can now be automated and made seamless. To support data extraction from PDF and transformation into ERP (SAP ECC, SAP S/4HANA), a configurable framework which handles multiple PDF formats, that supports multiple transformation approaches, and that provides scalability to cater to multiple outcomes has been built.
The next few sections of this blog post will detail the Custom Bot framework, its architecture and one working business scenario where this has been successfully used.
Overview of Configurable Data Extraction Architecture:
This custom bot is built on a five-layer configurable, scalable and adaptable architecture. These five layers are:
- Data layer: identifies the type of PDF
- Transformation layer: routes to the right transformation approach
- Configuration Layer: allows for customer specific key field and REGEX mappings
- Service consumption layer: calls different transformation mechanisms
- Deployment Layer: builds payload and calls API to interact with SAP S/4HANA
How the Bot works
From an end users’ perspective, the automation of PDF extraction and transformation is a seamless experience. Irrespective of the Input format, transformation mechanism or output deployment method, the end user will only have to schedule the Bot to run from the cloud factory and the framework takes care of the rest of the routing.
The simple process flow from the end users’ perspective for all bot runs is shown below:
Working Example: Scenario for Invoice Processing via Multi-Format PDF Extraction
Let us take an example of vendor invoices received in two different PDF formats from suppliers via email. Information from these PDF files will be read by the bot from the email and transformed into SAP S/4HANA as supplier invoices via an API.
Firstly, the end user performs just one step of Scheduling the BOT on cloud factory.
Subsequently, the framework takes over and the following steps explain how the bot works using all the different layers:
Data Layer: Identification of input format: The framework can differentiate Machine Readable (Digital) PDF from Image PDF. As a first step, the email is read, the file format bifurcation is done, and the files are downloaded into a local folder.
The following two input files have invoice relevant data in different formats. Also, the terminology used for data labels like purchase order, date or amount are different in both files as well. These two different invoices can be easily handled by the framework.
Transformation Layer: Depending on whether we are dealing with a Machine-Readable PDF or Image PDF, the framework starts the RPA workflow routing accordingly.
Configuration Layer: To support different PDF formats and different terminology for labels, we have a configuration file. The following configurations can be used to ensure the bot picks up customer specific data more accurately
- Attribute Configuration: Key-value mapping to ensure we pick the data irrespective of terminology
- REGEX Configuration: Regular expression configuration to pick the data more accurately in a recursive manner. This can be scaled to a large extent depending on the different business objects being used. For the example here, invoice related information is being extracted using different REGEX lookup options which loops through each attribute and applies the regex pattern check to retrieve the most accurate values
Service Consumption Layer: After the data extraction, the framework is now ready with a raw data set that has to be interpreted. Here again, depending on Machine Readable PDF or Image PDF the approach is different.
- Image PDF: For these types of PDF’s, an Optical Character Recognition (OCR) approach must be used. The framework currently uses either open source Tesseract or SAP Document Information Extraction service to process OCR outputs. This is again scalable to other OCR solutions. The output of the OCR transformation is raw text content which is put through the REGEX lookup again to retrieve the correct attributes in. json format.
Deployment Layer: The .json output from the service layer will then be used for building the payload to call the API that will create supplier invoices in SAP S/4HANA Cloud. For this example, supplier invoice API has been considered, however, this is adaptable for other business objects and for usage with other API’s as well.
In this way, with a single framework, multi-format PDF extraction can be used for process automation of different business objects in ERP. This current example is deployed for the latest version of SAP S/4HANA Cloud and it is adaptable for use with SAP S/4HANA On-premise and SAP ECC as well.
If you would like see this in action or require more information about consuming this framework, please get in touch with: SAP Intelligent RPA – Product Success APJ