Skip to Content
Technical Articles
Author's profile photo Will Conlon

Part 3 – Perform OCR on a .PDF using microservice hosted on SAP BTP, Kyma Runtime

Get creative using SAP Business Technology Platform, Kyma Runtime! Part 3

If you read Part 1 of this blog series, you’ll have seen how I’ve built a simple frontend user interface giving the user the option to select a file and trigger an upload using the Flask Python package and store it in a container in an SAP BTP, Kyma runtime pod on SAP Business Technology Platform. In Part 2, I’ve upgraded the frontend interface by leveraging low-code/no-code solution SAP AppGyver.

In Part 3, I’ll be sharing an example of possibilities that can now be unlocked which can deliver enormous business value by performing OCR (Optical Character Recognition) on the uploaded file and bringing the extracted information to the SAP AppGyver frontend. This involves extending the code written in Part 1 and Part 2 with the addition of a new python file to perform the OCR of the uploaded file. I won’t be showing any detailed rundown of running this locally, but simply put, get the docker container running locally first to test before adding to SAP BTP, Kyma Runtime.

Overview

I’ve created a completely fictitious form and companies for the purposes of this example as seen in Figure 1. The motivation behind this use case is where organizations find themselves with huge amounts of unorganized documents that contain important information that, if extracted, can be used within an ERP context, with especially high-value uses in analytics, compliance and business processes.

Figure%201.%20OCR%20and%20annotation%20of%20mock%20form%20to%20extract%20specific%20data

Figure 1. OCR and annotation of mock form to extract specific data

The python code from app.py in Part 1 gets a slight uplift by importing some additional dependencies and triggering the OCR function when a POST method containing a .PDF file is uploaded.

import json
import docExtraction
import os
from flask import Flask, request
from werkzeug.utils import secure_filename

UPLOAD_FOLDER = 'uploadFolder/'
ALLOWED_EXTENSIONS = {'txt', 'pdf', 'png', 'jpg', 'jpeg', 'gif'}

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = UPLOAD_FOLDER

def allowed_file(filename):
    return '.' in filename and \
           filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

@app.route('/upload/', methods=['GET', 'POST'])
def upload_file():
    if request.method == 'POST':
        # check if the post request has the file part
        print('File type is ' + str(request.files), flush=True)

        if 'file' not in request.files:
            print('No file part', flush=True)

            # message for appgyver alert
            return 'No file part'

        file = request.files['file']
        # If the user does not select a file, the browser submits an
        # empty file without a filename.
        if file.filename == '':
            print('No selected file', flush=True)

            # message for appgyver alert
            return 'No selected file'

        if file and allowed_file(file.filename):

            # create a secure filename
            filename = secure_filename(file.filename)

            # save file to /static/uploads
            filepath = os.path.join(app.config['UPLOAD_FOLDER'], filename)
            file.save(filepath)
            print('PDF saved to directory: ' + str(UPLOAD_FOLDER), flush=True)

            # call OCR function in docExtraction.py
            rois = docExtraction.process(filepath)
            print('OCR complete and saved to directory: ' + str(UPLOAD_FOLDER), flush=True)

            # json response
            return json.dumps(rois, sort_keys=True, indent=4)

        # message for appgyver alert
        print('The file name was not allowed.', flush=True)
        return ('The file name was not allowed.')

    return '''
    <!doctype html>
    <title>Upload new File</title>
    <h1>Upload File (Test View) </h1>
    <form method=post enctype=multipart/form-data>
      <input type=file name=file>
      <input type=submit value=Upload>
    </form>
    <p>This is a test page to test a file upload without using frontend</p>
    '''

if __name__ == '__main__':
    app.run('0.0.0.0','5000')

Note that I’ve imported OCR algorithm written in python called docExtraction.py which contains the logic to process this document.

Optical Character Recognition

I’d like to note that there are lots of superb ways of performing OCR, and of particular note is SAP’s Document Information Extraction which is part of the AI Business Services portfolio. The example that follows could be used where there is a specific reason to do so. Most common example I’ve seen is where there is a industry or document standard and different organizations/departments create different forms that adhere to that standard BUT DO NOT have the same document structure, ESPECIALLY where documents are manually scanned causing rotation and degradation of quality!

There are 3 main packages used in my program to run the solution on top of what I’ve built in previous parts of the blog series. These are:

  • OpenCV – A cross-platform open source highly optimized library with focus on real-time computer vision applications. It is free for commercial use and is released under the BSD 3-Clause License. I use the opencv-python wrapper package for OpenCV python bindings under MIT License. Note – I’d likely consider the headless option for production usage which will strip away dependencies that wouldn’t be needed.
  • pdf2image – A python module under MIT License that (quite simply) converts a PDF to a PIL Image object so I can process it with OpenCV above. Literally turns something complex into a single line of code.
  • Tesseract – An open source text recognition (OCR) engine available under Apache 2.0 license. I use the pytesseract python wrapper to read the text from the image I extract from pdf2image above. Again, turning something complex into a single line of code, even through I then have to filter through the text later to find what I want.

So now I’ll put these components together in a python file called docExtraction.py as follows:

import pdf2image
import cv2
import numpy
import datetime
import pytesseract

def process(filepath):

    # setup regions of interest (ROI, x , y, w, h, text extract (blank until getOCR)
    rois = [['CompanyName', 1450, 2380, 2250, 88, ''],
            ['ACNARBN', 1450, 2475, 2250, 88, ''],
            ['Address1', 1450, 2570, 2250, 88, ''],
            ['TownCity', 1450, 2750, 1250, 88, ''],
            ['State', 1450, 2848, 650, 84, ''],
            ['Postcode', 2815, 2848, 600, 84, ''],
            ['Country', 1450, 2942, 2250, 85, ''],
            ['Phone', 1450, 3046, 650, 85, ''],
            ['Email', 1450, 3155, 2250, 88, ''],
            ['BlockList', 1650, 1945, 2050, 88, ''],
            ['Date', 1650, 5120, 500, 90, '']]

    # declare var to hold images
    images = []

    # add every page in pdf as an image
    images.extend(list(map(lambda image: cv2.cvtColor(numpy.asarray(image), code=cv2.COLOR_RGB2BGR),
                           pdf2image.convert_from_path(filepath, dpi=500))))

    # if more than 1 page in pdf, then add loop e.g. for i in range(len(image)):
    # since my example is a single page I'll only look at page 1, i.e. images[0]
    images[0] = draw_border(images[0])

    # extract text in regions of interest and add to our ROIS.
    images[0], rois = getOCR(images[0], rois)

    values = []

    for r in range(len(rois)):
        values.append([rois[r][0], rois[r][5]])

    return values

def getOCR(image, rois):

    for i in range(len(rois)):

        # set local variables for  region of interest rectangle
        x, y, w, h = rois[i][1], rois[i][2], rois[i][3], rois[i][4]

        # create new local image with just region of interest
        image_roi = image[y:y+h, x:x+w]

        # convert colour region of interest to grayscale
        gray = cv2.cvtColor(image_roi, cv2.COLOR_BGR2GRAY)

        # get the text from region of interest
        rois[i][5] = pytesseract.image_to_string(gray)

        # draw regions of interest on original image
        cv2.rectangle(image, (x, y), (x + w, y + h), (241, 196, 15), 2)
        cv2.imwrite("uploadFolder/Output.png",image)

    return image, rois

def draw_border(image):
    hImg, wImg, _ = image.shape

    cv2.line(image, (0, 100), (int(wImg), 100), (34, 126, 230), 5)
    cv2.line(image, (int(wImg) - 100, 0), (int(wImg) - 100, int(hImg)), (34, 126, 230), 5)
    cv2.line(image, (int(wImg), int(hImg) - 100), (0, int(hImg) - 100), (34, 126, 230), 5)
    cv2.line(image, (100, int(hImg)), (100, 0), (34, 126, 230), 5)

    # datetime object containing current date and time
    now = datetime.datetime.now()
    # dd/mm/YY H:M:S
    dt_string = now.strftime("%d/%m/%Y %H:%M:%S")

    cv2.putText(image, "Processed - " + dt_string, (140, 80), cv2.FONT_HERSHEY_PLAIN, 4, (34, 126, 230), 4)

    return image

Once a file is uploaded, the process method is called with the path to the file (now in SAP BTP, Kyma runtime pod) I declare my regions of interest (ROI’s). In this case, they are hard-coded, and this would be where you could exit into your own code to define the logic around the ROI’s in your own document and how to determine them. NOTE – It can make sense for many use cases to run OCR first to find key values, horizontal lines, boxes, date formats etc. and use these elements to define where the ROI’s should be!

Each page of the pdf is then added to an array of images, given a border & timestamp (via the draw_border function using OpenCV) and sent to my getOCR function so the the region of interest (ONLY) undergoes text recognition using the pytesserract image_to_string() function. The extracted text then gets populated to the ROI’s list and relevant information gets returned to app.py and ultimately the response to the POST method in JSON format.

Running this locally via the upload URL triggering the basic HTML page (instead of AppGyver) as seen in Part 1 should give a JSON response similar to Figure 2. NOTE – If you want to configure this locally, don’t forget to add the necessary environment variables!

Figure%202.%20Extracted%20OCR%20information%20from%20uploaded%20file%20returned%20in%20JSON%20format.

Figure 2. Extracted OCR information from uploaded file returned in JSON format.

Containerize Application and Deploy

Now I can containerize this and run on SAP BTP, Kyma runtime. However, my container and going to need some additional commands and requirements added to facilitate these new OCR features. The following dockerfile works well, though I’m admittedly not an expert here so its likely the container could be thinner and more secure:

FROM ubuntu:18.04

ENV DEBIAN_FRONTEND=noninteractive

WORKDIR /program

RUN apt-get update \
  && apt-get -y install tesseract-ocr \
  && apt-get install -y python3 python3-distutils python3-pip \
  && cd /usr/local/bin \
  && ln -s /usr/bin/python3 python \
  && pip3 --no-cache-dir install --upgrade pip \
  && rm -rf /var/lib/apt/lists/*

RUN apt update \
  && apt-get install ffmpeg libsm6 libxext6 poppler-utils -y
RUN pip3 install pytesseract
RUN pip3 install opencv-python
RUN pip3 install pillow

COPY . .
RUN pip3 install -r requirements.txt

EXPOSE 5000

CMD ["python3", "./app.py"]

and the requirements.txt needs the additional components also:

Flask~=2.0.3
Werkzeug~=2.0.3
pdf2image~=1.16.0
opencv-python~=4.6.0.66
numpy~=1.19.5
pytesseract~=0.3.8

For deploying this to SAP BTP, Kyma runtime I follow the same steps I did in Part 1 but with some minor updates to the YAML file, mostly giving additional storage and memory now that we’ve got a heavier container with OpenCV, Tesseract and Pdf2Image.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: documentprocessing
spec:
  selector:
    matchLabels:
      app: documentprocessing
  replicas: 1
  template:
    metadata:
      labels:
        app: documentprocessing
    spec:
      containers:
      - env:
        - name: PORT
          value: "5000"
        image: /documentprocessing  # replace  with your Docker Hub account name
        name: documentprocessing
        ports:
        - containerPort: 5000
        resources:
          limits:
            ephemeral-storage: 2048M
            memory: 2048M
          requests:
            cpu: 100m
            ephemeral-storage: 2048M
            memory: 2048M
---
apiVersion: v1
kind: Service
metadata:
  name: documentprocessing-service
  labels:
    app: documentprocessing
spec:
  ports:
  - name: http
    port: 5000
  selector:
    app: documentprocessing
---
apiVersion: gateway.kyma-project.io/v1alpha1
kind: APIRule
metadata:
  name: documentprocessing-api
  labels:
    app: documentprocessing
spec:
  gateway: kyma-gateway.kyma-system.svc.cluster.local
  rules:
  - accessStrategies:
    - handler: allow
    methods:
    - GET
    - POST
    path: /.*
  service:
    host: documentprocessing-subd-node..kyma.ondemand.com # replace  with the values of your account
    name: documentprocessing-service
    port: 5000

This provides a deployment, a service and an API rule for the application. If you’re looking to deploy something similar, remember to add your own host and cluster details to avoid issues. This can be tested in same way as shown in Part 1 using the base URL (without SAP AppGyver) from the Flask return code (HTML).

Updates to the SAP AppGyver frontend

So now that I’ve got the backend working, I’d like to improve upon the frontend. The upload functionality as shown in Part 2 allows me to upload the document, but there will be far more business value to be unlocked by bringing the OCR data extracts back to the frontend application. The process is quite straight forward – when a document is uploaded, it will trigger the docExtraction.py and populate the region of interest list (ROIS), then return as JSON. This JSON is parsed and populated into text fields in the SAP AppGyver application.

The low-code JavaScript flow object from Part 2 needs to be extended to utilize response.json() returning a promise which resolves with the result of parsing the body text as JSON and since I know the expected JSON structure, I can easily return each element from my custom JavaScript to an SAP AppGyver page variable (called ‘List’) as seen below in Figure 3.

Figure%203%20SAP%20AppGyver%20flow%20for%20Upload%20with%20Page%20Variable%20structure

Figure 3. SAP AppGyver flow for upload and page variable structure

NOTE – Everything below the component flow in Figure 3 is from the variable view and displayed in this way to simply demonstrate the variable structure.

The JS Upload File code is updated as follows:

//  Goal is take the output of the 'Pick Files' flow function and submit to Flask route using multipart/form-data encoding and populate Page Variable with output as parsed JSON response.

//  Declare 2 inputs.
//  - First is the endpoint URL where we want to upload our file. We've hard coded this value.
//  - Second is the file with it's 6 object properties from the output of 'Pick Files'
let { url, file } = inputs

//  Get the path of the file selected in 'Pick Files'. Note - Only allowed single file upload so hard-coded to first i.e. file[0]
let path = await fetch(file[0].path)
console.log('Path:' + path)

//  Transform path into blob
let blob = await path.blob()

//  Declare the form we'll submit with the payload
const formData = new FormData()

//  Append upload details into the formData
formData.append('file', blob, file[0].name)

try {

  // POST the formData and parse the text/html (utf-8) response into JSON format
  const response = await fetch(url, { method: 'POST', body: formData }) 
  const parsed = await response.json();

  return [0, { 
    CompanyName: parsed[0][1], 
    ACNARBN: parsed[1][1],
    Address1: parsed[2][1],
    TownCity: parsed[3][1],
    State: parsed[4][1],
    Postcode: parsed[5][1],
    Country: parsed[6][1],
    Phone: parsed[7][1],
    Email: parsed[8][1],
    BlockList: parsed[9][1],
    Date: parsed[10][1]
    } ]

} catch  {

  return [0, { result: "File Error" } ]
}

The last step here is binding the output of the JS Upload File flow to the page variable. As seen in an earlier step, the page variable ‘List’ is an object with text components for each member of the dataset and needs to match the output of JS Upload File as seen in Figure 4, ensuring that the page variable is populated as “Object with properties”. Once the page variable is populated, the data extracts can be used as Content to fill text components.

Figure%204%20Binding%20configuration%20between%20JS%20Upload%20File%20and%20page%20variable%20contents%20displaying%20in%20text%20fields

Figure 4. Binding configuration between JS Upload File and Page Variable

Testing upload, data extraction and display in frontend

Now I can try upload a file and get a response from the backend to populate my SAP AppGyver table in the frontend. I’ve used webapp front-end for SAP AppGyver via Android device for the example seen in Figure 5.

Figure%205%20End-to-end%20test%20of%20file%20upload%20with%20OCR%20and%20response%20in%20frontend

Figure 5. End-to-end test of file upload with OCR and response in frontend

This gives me quite a lot of flexibility to solve real world business problems by using OCR on files to extract key value pairs. Note that this is an extremely simple example which can now be built upon using OpenCV and Tesseract in Python.

I’d certainly recommend looking into SAP’s Document Information Extraction solution in the first instance for use cases such as these since it has a very (very!) fast time to value for many requirements.

With data now flowing to and from SAP AppGyver and SAP BTP, Kyma runtime, a next step could be sending the PDF document & annotated OCR image showing ROI’s to persistent storage and posting the extracted data to a database for downstream processing or analytics.

If you have found this has been helpful, or have some feedback to share on this topic, please leave a comment and follow my profile for future posts.


For more information please see:

SAP Business Technology Platform Topic Page
Ask questions about SAP Business Technology Platform
Read other SAP Business Technology Platform articles and follow blog posts

There are other links embedded in the Blog which may be of use.

Assigned Tags

      Be the first to leave a comment
      You must be Logged on to comment or reply to a post.