Please Note: This blog provides links to information in SAP Service Marketplace, which requires that you log in there. If you need to register for a Service Marketplace login, or have forgotten your password, try the useful FAQ. If you already have a Service Marketplace login and would like to associate it with your SDN login, then we can help you right here.
Introduction You will certainly agree if I say that having large volumes of old data in the database will not really benefit the performance of your system, for example, when you are displaying individual business documents or running an analysis across several documents. The reason why this statement seems uncontentious is obvious: the larger the amount of data that the display or reporting program has to search through, the longer it takes to locate and display the data in question. But can this statement be regarded as “set in stone”, or as a kind of “physical law” that always applies? Can we deduce from this that system performance is more or less proportional to the size of the database, meaning that, if you reduce the data volume in the database, the performance inevitably increases by the same degree? In this blog I’ll try to shed some light on these questions and find out under which circumstances old data actually impairs performance and under which not.
Of Old and New Data
To get started, let’s briefly revisit the term “old data” mentioned above. What does it actually mean? Well, “old” refers to data pertaining to closed business processes that are not relevant for operational use any more. This type of data is also called “business complete”. It is not required any more for daily business, therefore, it is only accessed in rare cases, for example, during a tax or internal audit, or to fulfill product liability requirements. The scenario depicted in the first paragraph would certainly hold true if the data in the database were accessed at random with no distinction being made between old and new data. But, why on earth should a user ever define their selection criteria in such a way that the system does not only select the data in question, but also all kind of other data not needed for the task the user wants to accomplish? Assuming that users will never do this on purpose, we can also assume that, under normal circumstances, only new data is requested by the application. This implies that old data will not negatively affect performance, as it is never touched. Well, things are not as simple as they may seem because many performance problems are related to technical issues rather than to the selection behavior of users. For our discussion, it is important to visualize that data is always read from the hard disk in blocks, data units that are processed as a whole. This applies to both the tables holding the actual data, and the indexes used to access the data. In this blog I will concentrate on the mixture of old and new data in indexes, because the type of index used for accessing data strongly affects the I/O rate.
The Sorting Order is the Decisive Factor
So, even if a user restricts the data selection in an exemplary fashion, it cannot be avoided that old data from the index be read into the random access memory (RAM) of course, only if the data block transferred contains a mixture of new and old data. But how can this happen? Well, basically, a mixture of new and old data in the database tables can occur whenever data is deleted from the database. When new data is added to the database chances are high that it will be inserted into these “gaps”. For indexes, however, deletion is not a prerequisite for obtaining a mixture of old and new data, because new entries are always inserted according to the values of the index fields whether a suitable gap is available or not (if necessary, the database will even split a block to meet the requirement). Regarding the possible reuse of gaps in indexes we must know that index entries can be generally sorted in two different ways: either chronologically (organized in reference to a time factor) or non-chronologically (organized in reference to a factor other than time, e.g. GUIDs). In chronologically sorted indexes, which also includes indexes organized by document numbers (provided they ascend with time), old and new data never resides in close proximity to each other, making it virtually impossible that a “contaminated” data block with new and old data be read into the memory. With non-chronologically sorted indexes, however, things are different. As such an index does not ascend over time, new data will be inserted in locations distributed across the entire width of the index. This ends up in many data blocks containing a mixture of new and old data. When such a data block is read into the memory the buffer quality for query accesses goes down, leading to a considerable decrease in performance.
So, what is the message that I want to pass along to you with these reflections? Simply put, if improving performance is the major impetus for your project, get rid of old data that is accessed using a non-chronologically sorted index. In most cases this can be accomplished by archiving. Archiving data that is accessed using a chronologically sorted index, however, will free disk space, but it will never improve performance. (Theoretically, data archiving can also reduce the index depth, which may result in a performance increase. As this is virtually impossible to establish, I leave this aspect aside.)
If you want more detailed information on the influence of data archiving on performance, see our white paper “Performance Aspects of Data Archiving” available in the SAP Service Marketplace under /data-archiving -> Media Library -> Literature & Brochures.