Skip to Content

Network File System (NFS) is a distributed filesystem protocol that is commonly used to share files over the network. It enables users to mount remote directories on their servers and access the remote files in the same way local storage is accessed.

This tutorial explains how to access data stored on an NFS share from within SAP Data Hub Pipelines (on-premise).

Overview

The process to achieve this is as follows:

  1. Create a Persistent Volume (PV) for the NFS share in your Kubernetes cluster
  2. Create a Persistent Volume Claim (PVC) in your Kubernetes cluster which claims the PV (1)
  3. Create an SAP Data Hub Pipeline with a File Consumer operator that reads from a local path
  4. Add the File Consumer to an Operator Group and specify a mount point for the NFS Volume within the Group matching the local path (3)

NFS file share

In order to perform the following steps of this tutorial, you must have an NFS server running and a share with read/write permissions exported. For illustration purpose, we use

  • NFS Server Hostname: nfs-server-host
  • NFS Remote Directory: /remote_dir

in all the commands. Please make sure, that you replace the hostname and the remote directory with your NFS settings accordingly.

For demo purpose, we have placed two files in our remote directory:

[root@nfs-server-host remote_dir]# ls /remote_dir/
file1.txt  file2.txt

1. Create an NFS-based Persistent Volume

During runtime, the SAP Data Hub Pipeline runs Pipeline Operators as processes in Pods (groups of one or more containers) in the Kubernetes cluster. That means, to access data that is stored on an NFS share from within an Operator, the NFS share must be mounted in the corresponding Pod.

An NFS Volume (https://kubernetes.io/docs/concepts/storage/volumes/#nfs) allows an existing NFS share to be mounted into a Pod and this can be managed by the Kubernetes PersistentVolume (PV) API (https://kubernetes.io/docs/concepts/storage/persistent-volumes/):

  • Save the following PersistentVolume definition to a file, for example, nfs-pv.yaml and replace the server and the path with your NFS share details accordingly:
kind: PersistentVolume
apiVersion: v1
metadata:
  name: nfs-share-pv
spec:
  capacity:
    storage: 10Gi
  persistentVolumeReclaimPolicy: Retain
  accessModes:
    - ReadWriteMany
  nfs:
    server: nfs-server-host
    path: /remote_dir
  • Create the PersistentVolume with e.g. kubectl (Make sure to specify the namespace where the SAP Data Hub Distributed Runtime is installed):
[root@jumpbox ~]# kubectl create -f nfs-pv.yaml -n <namespace>
persistentvolume "nfs-share-pv" created
  • Verify that the PersistentVolume was created:
[root@mjumpbox ~]# kubectl get pv -n <namespace>
NAME                   CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS      CLAIM                         STORAGECLASS   REASON    AGE
nfs-share-pv           10Gi       RWX           Retain          Available                                                          5m

 

2. Create a Persistent Volume Claim

A PersistentVolumeClaim (PVC) (https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims) is a request for storage by a user. The Claim can request a specific volume size and access modes and based on these two attributes, a PVC is bound to a single PV. When a PV is bound to a PVC, that PV cannot be bound to another PVC. However, multiple Pods can use the same PVC. This is exactly what is happening when executing an SAP Data Hub Pipeline with an Operator Group that has a Volume mount point specified.

  • Save the following PersistentVolumeClaim definition to a file, for example, nfs-pvc.yaml, whereas the server and the path need to be replaced by your NFS share details:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: nfs-share-pvc
spec:
  accessModes:
  - ReadWriteMany
  resources:
     requests:
       storage: 10Gi
  • Create the PersistentVolumeClaim with kubectl (Make sure to specify the namespace where the SAP Data Hub Distributed Runtime is installed):
[root@jumpbox ~]# kubectl create -f nfs-pvc.yaml -n <namespace>
persistentvolumeclaim "nfs-share-pvc" created
  • Verify that the PVC was created and is bound to the NFS-based PV:
[root@mjumpbox ~]# kubectl get pvc -n <namespace>
NAME            STATUS    VOLUME         CAPACITY   ACCESSMODES   STORAGECLASS   AGE
nfs-share-pvc   Bound     nfs-share-pv   10Gi       RWX                          1m

 

3. Create an SAP Data Hub Pipeline with a File Consumer Operator

  • Create a new Graph in the SAP Data Hub Pipeline Modeler
  • Add a File Consumer Operator
  • Add a Terminal Operator
  • Connect the OutFilename Port of the File Consumer with the in1 Port of the Terminal:

  • Right-click the File Consumer and click on Open Configuration:

  • Set the path to /nfs_share (this is where we will later mount the NFS remote directory and optionally add for example .*.txt to the pattern field (this will consider only Text-files in the NFS share when reading the directory content):

 

4. Add a Group and specify the Volume Mount

  • Right-Click the File Consumer Operator and click on Group:

  • Right Click into the Group field and click on Open Configuration:

  • Give the Group a meaningful description, for example to NFS Mount:

  • Open the JSON definition of the Graph:

  • Navigate to the JSON definition of the Group defined before:

  • Add an attribute volumes to the existing groups object that references the PVC and specifies where the NFS volume should be mounted within the corresponding Pod:
"volumes": { "nfs-share-pvc": "/nfs_share" }​
  • This should result in a JSON document looking similar to this:

  • Switch back to the Diagram View and then Save and Execute the Graph.

 

  • When you right-click the Terminal, and click on Open UI, you should see all Text files that are stored in the NFS-share being polled every second:

That’s it.

To report this post you need to login first.

3 Comments

You must be Logged on to comment or reply to a post.

Leave a Reply