As the topic lead for SAP HANA (includes SAP Data Hub, SAP Hana Data Management Suite)
within the SAP Mentors influencer program, it is not uncommon for me to being asked what SAP Data Hub in fact is.
Of course there is the official SAP Data Hub web site, and also a great blog by the SAP Data Hub product owner Marc Hartz as well as a respective blog of mine, to put SAP Data Hub into perspective with the other SAP HANA Data Management Suite components.
However, in this blog, I take a reverse engineering look into what is available in SAP Data Hub 2.4.0 today:
Big data access platform
In the past, data had often been replicated into the applications that needed it to operate on with some sort of ETL (Extract, Transform, Load) tool. That is no longer feasible with big data, because the challenge is already to store and make it accessible once, e.g. via MapReduce in Apache Hadoop. Therefore such data is accessed in situ by SAP Data Hub:
These are the currently Supported Connection Types with their respective Capabilities.
Especially in conjunction with SAP Agile Data Preparation, as a big data enhancement to what I described in my SAP Agile Data Preparation Tutorial, only a subset of the data is downloaded temporarily for data transformation design, that could then subsequently be applied to the Whole Dataset:
Metadata in SAP Data Hub is managed via the SAP Datahub Metadata Explorer:
As a data scientist, e.g. searching for temperature information, I am presented with all the Catalogs with a potential for such data:
Massively parallel data processing platform
All SAP Data Hub operations, in this case e.g. also the above profiling tasks, are executed as Kubernetes Pods, in this case distributed over one driver and three execution ones:
This is not only a state-of-the-art cloud native architecture, but in my opinion, also potentially the foundation for other future SAP products.