Having used SAP IQ since 1996, I have been involved in a significant number of customer journeys with the technology. I have also helped shape the product as well as help architect new technologies like SAP HANA dynamic tiering, SAP Nearline Store for Business Warehouse, and SAP Information Lifecycle Management for SAP Applications. Through this journey, I have collected quite a lot of experience with building new and innovative systems with SAP IQ. The main outcome of this has been the SAP IQ Hardware Sizing Guide (https://www.sap.com/documents/2017/02/363ddfab-a77c-0010-82c7-eda71af511fa.html). What I’ve started to notice, however, is that customer implementations are getting larger and larger, yet we tend to see that they are architected, deployed, and operationally managed the same as smaller systems. The main focus of this living blog will be to share my experiences with building these systems not only from a hardware and resource level, but also from an operational perspective.
For the sake of this blog and various calculations and algorithms used, let us assume that the system being built is an SAP IQ Multiplex with 40 nodes and a total database size of 1 PB (after compression). Each node will have 80 cores and 2 TB RAM.
SAP IQ has a manual section that covers the absolute limits of SAP IQ. This is covered in the “Introduction to SAP IQ” manual here: https://help.sap.com/viewer/a896c6a184f21015b5bcf4c7a967df07/126.96.36.199/en-US/a59810dd84f2101588df81a6f71e8ba3.html
It is important to note that these are the absolute, programmatic limits of SAP which may differ from the practical or best practices limits.
When dealing with massive systems, some things just take longer. For instance, backing up a database that is 100TB will be faster than backing up a system that is 10x larger at 1000TB (1PB). It is imperative that we realistically set expectations so that we don’t set ourselves up for failure with unrealistic expectations. We must be prepared to change how we view operational tasks at full scale and forget how we would have done those same tasks on a system that is 1-10% the size.
Database Consistency Checks
A system that is hundreds of TB or even 1 PB or more will have to change how consistency checks are done. While improvements have been made in the internals of these checks, the size of these systems still drive an increased runtime of these checks. We simply cannot use the whole database consistency check as we would on smaller systems. While it is much easier to issue a single statement and check the output, it is not efficient enough for large scale systems.
What has worked quite well is to simply check a subset of objects during this maintenance cycle. Some customers have implemented a weekly schedule where individual objects are checked quarterly, but the process runs weekly on a rolling subset of tables.
What has worked quite well for the very large systems is to use some of the object metadata available in IQ. Every table has a timestamp for when the data was last changed. If the data hasn’t changed since the last time a consistency check was run, there is little or no need to run a consistency check again. This leads to more efficient and focused consistency checks over time. Granted, the first time the process is run all objects must be checked as it was never run before. During the next run, though, only those tables whose data has changed since the last check need to be checked again.
Backups and Recovery
Every database needs to be backed up for high availability and disaster recovery protection. The issue is that traditional mechanisms don’t work at this scale. Backing up a database to disk or table at 5 TB/hour works quite well for a system that is less than 50-100 TB in size. At that speed, it would take 200 hours (over 8 days) to back up a 1PB database. Recovery time would be at least as long. You then have to factor in the number of tapes that it would take to hold such a large backup, if tapes are used.
One solution that has been used by IQ customers for over 15 years, is to use storage level backups. In short, leverage the storage technology to copy/replicate the IQ database to an alternate storage infrastructure. This will double the disk space needed to 2 PB, but it does provide for a near instantaneous recovery as it is a mirror image of the database and requires no recovery process. This backup copy has also been used to perform consistency checks so as not to impact production. As consistency checks usually come out clean, we don’t waste production resources. When an anomaly does occur, it would then be addressed in the production system so that the next check run on the backup comes out clean.
An alternate solution, that has been gaining some traction, is to use multiple dbspaces within SAP IQ to store different data sets. Historical data that is not changed, needs to be backed up just once. If that data were in a dedicated dbspace, it could be backed up once and not touched again until its data changes. The goal here is to reduce the total amount of data being backed up to only that data that has change, not the historical information that is static.
With system of this size, it is quite common to have an architecture that must support data loading and aggregation that is constantly running. Given that loads will be continuously running, it is important to dedicate a subset of SAP IQ nodes to data loading. This will allow for a consistent load performance and for systems that can be properly tuned for loading data. I would not recommend having the nodes handle both queries and data loading. We want to isolate the major tasks (read vs write) to nodes dedicate to that purpose. It can even be broken down further so that nodes dedicated to loading are further dedicated to loading certain tables.
End User Querying
Too many times, customers rush straight into DQP (dynamic query processing) in systems of this size. DQP allows a single query (SELECT statement) to be broken into very small work units that are then distributed across a subset of nodes in the multiplex. On the surface, this sounds perfectly normal. Experience, though, shows that this does not work well for all queries. While IQ does an admirable job at determining which queries to run in DQP and which to leave on a single SMP host (non-DQP mode), this is not a one size fits all situation. Query behavior needs to be investigated to see whether or not DQP would help. Once that decision is made, it is critical that the workload is tested both with and without DQP so that a proper decision can be made as to the direction the system takes on. The good news is that the DQP infrastructure is flexible enough to allow these changes to happen at any time without having to rearchitect the system or data.
The SAP IQ Hardware Sizing Guide explains, in detail, the proper sizing formulas for storage in an SAP IQ environment. Please follow this guidance for optimal performance in most use cases. To paraphrase that guidance, storage should be configured to deliver 50-100 MB/sec of IO per core in the system. In this 40 node system with 80 cores per node, there are a total of 3200 cores. The underlying storage should be able to deliver 16000-32000 MB/sec of total IO (3200 cores * 50 MB/sec, 3200 cores * 100 MB/sec).
IQ Page Size
Given that these systems are large, the page size should be 512KB and can only be set when the database is created. Normally, the page size would be much lower, however, these systems have significant RAM (1-2TB or more) to handle the increased page size without putting an undue burden on storage. It also allows the storage subsystem to be more effective with reading and writing data in much larger chunks.
System Main (IQ_SYSTEM_MAIN)
System main should be sized as a function of the main store(s) in the multiplex. As a general rule, we recommend sizing system main at 20% of the overall size of the IQ instance. When dealing with a system of this size, that is a significant amount of space at 200TB. Rather, consider allocating 10% of the database size aside for system main.
It is absolutely critical that the storage for IQ_SYSTEM_MAIN be on the fastest possible shared storage subsystem. SSDs or flash disks are strongly recommended. Each raw device that is presented to IQ should be able to sustain 500 MB/sec to 1 GB/sec of IO or more. This is necessary so that the checkpoint process, TLV, and freelist do not encounter IO bottlenecks. As database sizes increase so do the internal structures needed to manage that database (block map, freelist, dbspace mapping). These are critical for any data manipulation and are constantly revised. If the system main device(s) is slow, it will slow down data manipulation operations, commits, and checkpoints. In prior benchmarks (12PB SAP IQ Guinness World Record), we have seen checkpoint operations take 30 minutes or longer due to the sheer size. The product was enhanced to eliminate the internal bottlenecks, but raw device speed still plays a significant role in performance.
Under no circumstances should user data and indexes be placed on the system main dbspace. This is a system area and should only be used by SAP IQ for internal and system related activities.
In general, there is not much to be said about how to properly size the user defined main dbspace(s) in SAP IQ. This must be sized to store your data for a period of time that the business requires.
What is often overlooked is that this should always be oversized. Typically, 5-10% of this space should be free at all times. We want this extra space available so that when tasks like rebuilding indexes take place, there is enough free storage to accommodate the in-flight changes. The size needed will completely depend on the implementation, data, compression ratios, and row count or total size. In reality, you need enough extra space to contain the in-flight changes for all objects that are being rebuilt at a point in time. As this is generally unknown, 5-10% offers enough extra space to accommodate most systems.
Lastly, it is critical to follow the proper storage throughput sizing guidelines referenced in the SAP IQ Hardware Sizing Guide. Data load and query performance will hinge on having storage that can serve data up fast enough for the applications.
As mentioned in the “Backups and Recovery” section above, the use of multiple dbspaces is a way to help isolate data sets and offer a more robust environment with a data storage infrastructure that can be tailored to the needs of the data. Multiple dbspaces allows you to direct data to a class of storage that makes sense for the data: hot data on ultra-fast disk, less frequently used data on slower, less expensive disks, etc. Multiple dbspaces also allows you to develop a backup strategy that is much more flexible in nature in that older data can be safeguarded once while newer, more frequently changed data would need to be backed up more frequently.
Local Temporary Storage (IQ_SYSTEM_TEMP)
The IQ temp store provides two use cases. It is used for query and loading workspace as well as the storage location for temp tables. While there are options to shift both use cases to the shared temporary storage model, this is not recommended. Best practices and successful implementations show that using the local temporary storage for temporary tables and load/query work product is the best path forward.
As this is temporary storage, storage redundancy is not required. I see customers shifting from deep storage protection to faster storage that can serve data up quicker. In extreme cases, I even see customers moving their local temporary storage to either SSDs for flash storage for the ultimate in performance of such a volatile area.
IQ_SYSTEM_TEMP is generally sized, in total, at 10-25% the total size of the IQ database. For a system of this size, that would be 100-250TB distributed across all nodes. This is a general starting point and may need to be adjusted up or down based on the activity on individual nodes. For instance, nodes that are doing more loading than queries may need less storage, while highly concurrent query nodes could need more.
The temp sizing is more art than science. By that I mean, that it is quite hard to know up front what the sizing should be. Sizing will vary widely as more users are added, as more complicated queries are used, whether or not DQP is enabled, and whether or not the application logic creates temporary storage structures.
Shared Temporary Storage (IQ_SHARED_TEMP)
Shared temporary storage is used for intermediate result sharing of query fragments in a DQP (dynamic query processing) scenario. IQ can use either physical shared storage or a network interface for this result sharing. It is NOT recommended to use the network-based approach for intermediate result sharing due to performance and latency issues. I would also recommend not using shared temporary storage for temp tables and local work product. Allow IQ_SHARED_TEMP to focus solely on DQP intermediate result sharing and nothing else.
IQ_SHARED_TEMP is only needed if they plan on running DQP. I usually use the same sizing metric for shared temp as I do for system temp. 10-25% the size of the IQ database. It is important to note that IQ prefers to keep space allocated to a node once it has already been requested but is done being used. This prevents IQ from having to constantly ask for disk space only to free it up seconds later when it isn’t needed. The side effect of this is that the space will be reserved for a node and not usable by any other node, until it has been completely freed.
It is strongly recommended that the network infrastructure be divided into two aspects: public and private. The private network would be dedicated to IQ node-to-node communication while the public network would be used to serve up the results to end users (client-server communication).
I would use a network that has a minimum of 10 gbit/sec throughput. For the private network, just one channel/controller per host is needed. The public network may require multiple network interfaces depending on the amount of data that is being sent between the clients and servers. While most networks cap at 10 gbit/sec, it is quite common to see bonded networks used to increase throughput while keeping the architect and infrastructure simple (https://en.wikipedia.org/wiki/Channel_bonding).
The SAP IQ Hardware Sizing Guide recommends starting each of the three main caches at 30% of the machine RAM. For a system with 2 TB, that would be 600GB for main cache (-iqmc), 600GB for temp cache (-iqtc), and 600GB for large memory (-iqlm), leaving 200 GB of RAM to the OS and any non-IQ applications. Once the system is established, monitor the cache usage to adjust as necessary. I cannot overstate how critical it is to have enough large memory for IQ, regardless of what the monitor output shows. We have seen, time after time, that the large memory section is reduced to the point of causing bottlenecks and being completely ineffective. My preference is to raise and lower the main and temp caches without touching large memory.
It is recommended to leave the IQ threads setting to default (-iqmt). The catalog, however, should have threads (-gn) set to 1.5x the number of connections (-gm). If -gm (connections) is set to 100, then -gn (threads) should be set to 150.
It is critical that the 1.5 threads per connection metric be followed so as to not starve the IQ workload in the catalog.
It is also important not to set the number of connections significantly higher than the expected peak. IQ carries overhead for each possible connection. If you set IQ to the maximum of 1024 connections, we allocate threads and resources for 1024 concurrent users. This is extremely wasteful when the actual concurrency is much lower. It is perfectly normal for these options to change on a node by node basis as the workloads would be very different.
Connection Threads (Max_IQ_Threads_Per_Connection and Max_IQ_Threads_Per_Team)
Once connected, a user will have a pool of up to 144 threads (default) to use for processing their workload. Generally, it is not necessary to change these options. However, we have seen that for high concurrency systems with continuous data loading, that having such a large value slows down the processing time.
As the concurrency increases, so does the total number of threads needed for the system to perform the work. Due to the high thread setting (144) per connection, each additional user caused a significant amount of overhead at the OS level to manage the number of threads needed to perform concurrent loads. In short, IQ will send too many threads to the OS causing significant overhead just to manage the threads.
What we can infer from this, and prior tests, is that lowering the overall threads needed for a load, or series of loads, needs will reduce the impact on the system. The impact is reduced simply because the OS does not need to manage nearly as many threads. If the thread counts are reduced from 144 to 5, this results in the OS needing to manage 96% fewer threads. With this, the percentage of time that the OS spends in system (thread management) is replaced by a higher user consumption of the CPUs and a more efficient SAP IQ system.
As with all options, it is best to test them for your particular scenario as the impact can vary widely.
When moving into systems that manage hundreds of terabytes to petabytes of compressed data, it is imperative to understand that nearly every aspect of building, maintaining, and managing a system of this magnitude. I cannot stress enough how setting proper expectations is critical. It is unrealistic to expect a system of this size to perform the same as something that is 1/10th the size.