How to access NFS shares from within SAP Data Hub ...

jens_rannacher · ‎03-15-2018

Network File System (NFS) is a distributed filesystem protocol that is commonly used to share files over the network. It enables users to mount remote directories on their servers and access the remote files in the same way local storage is accessed.

This tutorial explains how to access data stored on an NFS share from within SAP Data Hub Pipelines (on-premise).

Overview

The process to achieve this is as follows:

Create a Persistent Volume (PV) for the NFS share in your Kubernetes cluster

Create a Persistent Volume Claim (PVC) in your Kubernetes cluster which claims the PV (1)

Create an SAP Data Hub Pipeline with a File Consumer operator that reads from a local path

Add the File Consumer to an Operator Group and specify a mount point for the NFS Volume within the Group matching the local path (3)

NFS file share

In order to perform the following steps of this tutorial, you must have an NFS server running and a share with read/write permissions exported. For illustration purpose, we use

NFS Server Hostname: nfs-server-host

NFS Remote Directory: /remote_dir

in all the commands. Please make sure, that you replace the hostname and the remote directory with your NFS settings accordingly.

For demo purpose, we have placed two files in our remote directory:

[root@nfs-server-host remote_dir]# ls /remote_dir/

file1.txt  file2.txt

1. Create an NFS-based Persistent Volume

During runtime, the SAP Data Hub Pipeline runs Pipeline Operators as processes in Pods (groups of one or more containers) in the Kubernetes cluster. That means, to access data that is stored on an NFS share from within an Operator, the NFS share must be mounted in the corresponding Pod.

An NFS Volume (https://kubernetes.io/docs/concepts/storage/volumes/#nfs) allows an existing NFS share to be mounted into a Pod and this can be managed by the Kubernetes PersistentVolume (PV) API (https://kubernetes.io/docs/concepts/storage/persistent-volumes/😞

Save the following PersistentVolume definition to a file, for example, nfs-pv.yaml and replace the server and the path with your NFS share details accordingly:

kind: PersistentVolume

apiVersion: v1

metadata:

  name: nfs-share-pv

spec:

  capacity:

    storage: 10Gi

  persistentVolumeReclaimPolicy: Retain

  accessModes:

    - ReadWriteMany

  nfs:

    server: nfs-server-host

    path: /remote_dir

Create the PersistentVolume with e.g. kubectl (Make sure to specify the namespace where the SAP Data Hub Distributed Runtime is installed):

[root@jumpbox ~]# kubectl create -f nfs-pv.yaml -n <namespace>

persistentvolume "nfs-share-pv" created

Verify that the PersistentVolume was created:

[root@mjumpbox ~]# kubectl get pv -n <namespace>

NAME                   CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS      CLAIM                         STORAGECLASS   REASON    AGE

nfs-share-pv           10Gi       RWX           Retain          Available                                                          5m

2. Create a Persistent Volume Claim

A PersistentVolumeClaim (PVC) (https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims) is a request for storage by a user. The Claim can request a specific volume size and access modes and based on these two attributes, a PVC is bound to a single PV. When a PV is bound to a PVC, that PV cannot be bound to another PVC. However, multiple Pods can use the same PVC. This is exactly what is happening when executing an SAP Data Hub Pipeline with an Operator Group that has a Volume mount point specified.

Save the following PersistentVolumeClaim definition to a file, for example, nfs-pvc.yaml, whereas the server and the path need to be replaced by your NFS share details:

kind: PersistentVolumeClaim

apiVersion: v1

metadata:

  name: nfs-share-pvc

spec:

  accessModes:

  - ReadWriteMany

  resources:

     requests:

       storage: 10Gi

Create the PersistentVolumeClaim with kubectl (Make sure to specify the namespace where the SAP Data Hub Distributed Runtime is installed):

[root@jumpbox ~]# kubectl create -f nfs-pvc.yaml -n <namespace>

persistentvolumeclaim "nfs-share-pvc" created

Verify that the PVC was created and is bound to the NFS-based PV:

[root@mjumpbox ~]# kubectl get pvc -n <namespace>

NAME            STATUS    VOLUME         CAPACITY   ACCESSMODES   STORAGECLASS   AGE

nfs-share-pvc   Bound     nfs-share-pv   10Gi       RWX                          1m

3. Create an SAP Data Hub Pipeline with a File Consumer Operator

Create a new Graph in the SAP Data Hub Pipeline Modeler

Add a File Consumer Operator

Add a Terminal Operator

Connect the OutFilename Port of the File Consumer with the in1 Port of the Terminal:

Right-click the File Consumer and click on Open Configuration:

Set the path to /nfs_share (this is where we will later mount the NFS remote directory and optionally add for example .*.txt to the pattern field (this will consider only Text-files in the NFS share when reading the directory content):

4. Add a Group and specify the Volume Mount

Right-Click the File Consumer Operator and click on Group:

Right Click into the Group field and click on Open Configuration:

Give the Group a meaningful description, for example to NFS Mount:

Open the JSON definition of the Graph:

Navigate to the JSON definition of the Group defined before:

Add an attribute volumes to the existing groups object that references the PVC and specifies where the NFS volume should be mounted within the corresponding Pod:

"volumes": { "nfs-share-pvc": "/nfs_share" }

This should result in a JSON document looking similar to this:

Switch back to the Diagram View and then Save and Execute the Graph.

When you right-click the Terminal, and click on Open UI, you should see all Text files that are stored in the NFS-share being polled every second:

That's it.