How to access NFS shares from within SAP Data Hub Pipelines
Network File System (NFS) is a distributed filesystem protocol that is commonly used to share files over the network. It enables users to mount remote directories on their servers and access the remote files in the same way local storage is accessed.
This tutorial explains how to access data stored on an NFS share from within SAP Data Hub Pipelines (on-premise).
Overview
The process to achieve this is as follows:
- Create a Persistent Volume (PV) for the NFS share in your Kubernetes cluster
- Create a Persistent Volume Claim (PVC) in your Kubernetes cluster which claims the PV (1)
- Create an SAP Data Hub Pipeline with a File Consumer operator that reads from a local path
- Add the File Consumer to an Operator Group and specify a mount point for the NFS Volume within the Group matching the local path (3)
NFS file share
In order to perform the following steps of this tutorial, you must have an NFS server running and a share with read/write permissions exported. For illustration purpose, we use
- NFS Server Hostname: nfs-server-host
- NFS Remote Directory: /remote_dir
in all the commands. Please make sure, that you replace the hostname and the remote directory with your NFS settings accordingly.
For demo purpose, we have placed two files in our remote directory:
[root@nfs-server-host remote_dir]# ls /remote_dir/
file1.txt file2.txt
1. Create an NFS-based Persistent Volume
During runtime, the SAP Data Hub Pipeline runs Pipeline Operators as processes in Pods (groups of one or more containers) in the Kubernetes cluster. That means, to access data that is stored on an NFS share from within an Operator, the NFS share must be mounted in the corresponding Pod.
An NFS Volume (https://kubernetes.io/docs/concepts/storage/volumes/#nfs) allows an existing NFS share to be mounted into a Pod and this can be managed by the Kubernetes PersistentVolume (PV) API (https://kubernetes.io/docs/concepts/storage/persistent-volumes/):
- Save the following PersistentVolume definition to a file, for example, nfs-pv.yaml and replace the server and the path with your NFS share details accordingly:
kind: PersistentVolume
apiVersion: v1
metadata:
name: nfs-share-pv
spec:
capacity:
storage: 10Gi
persistentVolumeReclaimPolicy: Retain
accessModes:
- ReadWriteMany
nfs:
server: nfs-server-host
path: /remote_dir
- Create the PersistentVolume with e.g. kubectl (Make sure to specify the namespace where the SAP Data Hub Distributed Runtime is installed):
[root@jumpbox ~]# kubectl create -f nfs-pv.yaml -n <namespace>
persistentvolume "nfs-share-pv" created
- Verify that the PersistentVolume was created:
[root@mjumpbox ~]# kubectl get pv -n <namespace>
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM STORAGECLASS REASON AGE
nfs-share-pv 10Gi RWX Retain Available 5m
2. Create a Persistent Volume Claim
A PersistentVolumeClaim (PVC) (https://kubernetes.io/docs/concepts/storage/persistent-volumes/#persistentvolumeclaims) is a request for storage by a user. The Claim can request a specific volume size and access modes and based on these two attributes, a PVC is bound to a single PV. When a PV is bound to a PVC, that PV cannot be bound to another PVC. However, multiple Pods can use the same PVC. This is exactly what is happening when executing an SAP Data Hub Pipeline with an Operator Group that has a Volume mount point specified.
- Save the following PersistentVolumeClaim definition to a file, for example, nfs-pvc.yaml, whereas the server and the path need to be replaced by your NFS share details:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: nfs-share-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 10Gi
- Create the PersistentVolumeClaim with kubectl (Make sure to specify the namespace where the SAP Data Hub Distributed Runtime is installed):
[root@jumpbox ~]# kubectl create -f nfs-pvc.yaml -n <namespace>
persistentvolumeclaim "nfs-share-pvc" created
- Verify that the PVC was created and is bound to the NFS-based PV:
[root@mjumpbox ~]# kubectl get pvc -n <namespace>
NAME STATUS VOLUME CAPACITY ACCESSMODES STORAGECLASS AGE
nfs-share-pvc Bound nfs-share-pv 10Gi RWX 1m
3. Create an SAP Data Hub Pipeline with a File Consumer Operator
- Create a new Graph in the SAP Data Hub Pipeline Modeler
- Add a File Consumer Operator
- Add a Terminal Operator
- Connect the OutFilename Port of the File Consumer with the in1 Port of the Terminal:
- Right-click the File Consumer and click on Open Configuration:
- Set the path to /nfs_share (this is where we will later mount the NFS remote directory and optionally add for example .*.txt to the pattern field (this will consider only Text-files in the NFS share when reading the directory content):
4. Add a Group and specify the Volume Mount
- Right-Click the File Consumer Operator and click on Group:
- Right Click into the Group field and click on Open Configuration:
- Give the Group a meaningful description, for example to NFS Mount:
- Open the JSON definition of the Graph:
- Navigate to the JSON definition of the Group defined before:
- Add an attribute volumes to the existing groups object that references the PVC and specifies where the NFS volume should be mounted within the corresponding Pod:
"volumes": { "nfs-share-pvc": "/nfs_share" }
- This should result in a JSON document looking similar to this:
- Switch back to the Diagram View and then Save and Execute the Graph.
- When you right-click the Terminal, and click on Open UI, you should see all Text files that are stored in the NFS-share being polled every second:
That’s it.
In the File Consumer configuration, how to read the file from the local host, I mean wherever Data hub is installed?
Hi Vikky,
when Data Hub is running in cluster mode (is installed on Kubernetes), the path is always local to the Pod / Container where the Operator is running. In order to access files from the local host, you have to create a local persistent volume (https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta/) instead of an NFS volume and then follow the same steps described in this post.
Best regards
Jens
Thanks a lot for the information.