The widespread adoption of Spark as a distributed processing engine over the last 3 years has been shifting the profile of big data systems towards in-memory processing; in a similar manner to the in-memory shifts in the database product world.
Federated in-memory data processing on big-data clusters is complex because of the need to distribute processing over tens (hundreds, or thousands) of machines/nodes/workers. While database systems may have dense complexity around concurrency, transaction management and procedural code execution; no vendor database is designed to run across hundreds of machines.
Apache Spark (“a fast and general purpose cluster computing engine”) was designed to manage distributed workloads in-memory: including split and merge of tasks and results, data shuffling between nodes, failure handling, multiple APIs (Java, R, scala and Python).
Spark generally runs alongside MapReduce on Hadoop clusters (or data lakes) where data processing can be scheduled on either processing engine depending on which is more suitable. Unlike on-prem clusters, cloud data lakes are usually containerized; so the number of cluster nodes is not limited by the number of physical machines in the cluster. By dynamically adding containers to the cluster when the workload is high, Spark can redistribute its workload to the new containers, scaling the processing power of the cluster. Fixed node data lakes, where each physical machine is a cluster worker node are being replaced by the next generation of containerized clusters, that can rapidly scale up and down as workload demands. These systems are a natural fit for the cloud where complex system administration and security can be centralized for large clusters of servers.
Running Spark on a containerized cluster also enables scaling of each of Spark’s sub-engines: Spark SQL, Spark Streaming, Spark mlib (Machine Learning), and Spark GraphX (Graph engine): opening up new possibilities for processing enterprise data at scale, in memory.
Instead of traditional file based ETL, Spark Streaming processes incoming data using tight cycle batches that behave like batch pipelines. Spark Structured Streaming (released with Spark 2.x) has simplified concepts and syntax for stream definition and modeling. Anyone that understands SQL can now write structured streams. Streaming queries resemble SELECT statements that pipeline data to the next streaming layer; where it can be persisted (parquet or csv) or it can be aggregated in-memory across all cluster nodes while still being queried externally as a global temporary view.
Structured streaming enables in-memory aggregation of streaming data: in other words – one of the key features of on-premise streaming. By pairing Spark Streaming with on-prem (or on-cloud) streaming products, it is now possible to build low-latency high-volume streaming pipelines of data into containerized big-data systems, where data can be queried and aggregated, at scale, throughout the pipeline. Sensitive data can be tokenized before being sent to the cloud so that data science analysis and aggregation is still possible without revealing personally identifiable data.
With a rich portfolio of data platform products, SAP is uniquely positioned to build high performance Continuous Processing platforms, including SAP HANA, SAP HANA Streaming Analytics, SAP Cloud Platform Big Data Services, SAP HANA Smart Data Access and SAP Analytics Cloud; paired with various open source products that customers may already be running, such as Apache Kafka, Prometheus and Grafana.
Each product serves a purpose: SAP Streaming Analytics runs on-prem (or in-country) as a native service for the SAP HANA database – enabling functionality like tokenization of sensitive data, database vaulting of generated tokens, filtering and realtime aggregation of recent streaming activity (minutes, hours). Spark Structured Streaming, running on the big-data cluster, processes multiple concurrent streams of incoming data, maintaining in-memory aggregates of much larger volumes (days, weeks); repartitioning and reformating the streams to save as parquet file for rapid SQL query retrieval. In between the two streaming endpoints, Apache Kafka manages the high throughput, low latency data feed that can be parallelized, managed and monitored at the level of an enterprise application. Secure, low latency, reliable streaming to global endpoints is not only feasible, but it has become desirable: big data can be sent to platforms where is the most secure and the best managed.
My next blog post will provide more details about a system running inside SAP, processing hundreds of millions of streaming messages per day between Singapore and Germany.