Skip to Content
Author's profile photo Hannes Kuehnemund

IO-Schedulers on Linux

Scheduler on Linux One might argue that schedulers may not be needed for operating system, as they introduce another level of complexity into the planing and understanding how things work. This kind of view is kind of outdated, because without an IO scheduler, the kernel would write down every write request in order it receives it. This would trash the disk or disk system completely. Furthermore, if between two request of writing some data onto the disk the kernel has to read some other data from a complete other part of the disk, the disk head has to seek from one location to another and back to fullfill the read request in between the two write requests. Thus optimizing disk requests is the main purpose of IO schedulers. Further and more technical information can be found in the Linux kernel source in the folder Documentation/block. Basically all text documents in this folder describe the latest or outdated schedulers. Overview of currently available schedulers on Linux: as:

  • setting antic_expire turns as into deadline scheduler
  • based on the notion that a disk device has only one physical seeking head
  • implements 5 layers of scheduler Policies


  • currently cfq version 3 (cfqv3) available but for the test only cfqv2 was available
  • implements three generic scheduling classes
  • priority can be set according to process nice level


  • Favors reads over writes
  • requests have an expire time
  • reorder requests to improve I/O


  • perfect scheduler for e.g. flash memory cards
  • request are queue in FIFO order
  • Only last requests added may be merged into one request

The noop scheduler was not testing during the benchmark! Setup of benchmark The software setup which I used to test the three most common schedulers (as, cfq and deadline) is the following. The operating system used is a SLES10 wich runs on top of a Dual Socket 3,6Ghz Xeon EM64T with Hyperthreading enabled. There are three 36GB SCSI disks installed. The partitioning is as follows. Disk sda1 (8GB) is the root filesystem, sda2 (4GB) is swap and the remaining sda3 (24GB), sdb1 (36GB) and sdc1 (36GB) are put into one big LVM volume group. This volume group is used exclusively by one big local volume mounted to /usr/sap. On top of the SLES10 operating system a new SAP system is installed manually. This is done by installing MaxDB database binaries manually, creating SAP Instance folders, extracting SAP kernel binaries, copying the R3load files on the logical volume, create instance profiles manually and before the measurement starts, the creation of six data volumes with each having 15GB and the log volume with 2GB on the logical volume. After all these steps the next one is the import of all R3load export files with R3load. What is measured During the load phase five R3load processes are loading data into the database simulateously. The import is started with the five biggest export files, out of 34 available. When the first R3load is done with its file it proceeds to the next one until all files are loaded to the database. These 34 files contain of more then 44000 tables with are all imported into the database. These tables use approximately 73GB in the database. After the import of every table a MaxDB savepoint is triggered, which writes all data which is currently in the data cache to the data volume. This ensures that there are several concurrent writes on the disk. After all files are processed the initialization reports RSWBOINS and RADDBDIF are executed and the import is done. The time between the start of the import until the end of the import phase plus executing the reports is measured and compared in the end. One might argue that the duration of creating data and log volumes should also be measured, but creating volumes by writing zeros in order on the disk may not be affected by schedules that much. Furthermore the LVM configuration introduces another level of complexity which may have an influence on the results, however I like LVM and the using LVM for benchmarks is not forbidden! Needless to say, that after each run the system was cleaned up and rebooted to have a fresh environment for the next benchmark. Benchmark Results with DBLoad I leave the following numbers uncommented. They may leave space for some discussion therefore no conclusion is made by myself. The first column shows the scheduler used (cfq, as or deadline), the second column shows the seconds it took to load the database.

scheduler load time
cfq 25908 s
as 26645 s
deadline 28217 s

Assigned Tags

      You must be Logged on to comment or reply to a post.
      Author's profile photo Former Member
      Former Member
      Probably cfq is taking more time because it is reordering requests.

      What is 5 layers of scheduling policy u mentioned in AS?

      Can you tell the efficiencies of these scheduling algorithms in O notation? We will get an idea of how this algos work.

      Btw How much RAM did that system had? Is it 8 GB? Thats what i can figure from the swap u mentioned.


      Author's profile photo Hannes Kuehnemund
      Hannes Kuehnemund
      Blog Post Author
      Hi Puru,

      thanks for your questions, I hope my answer are the ones you are looking for.

      I'm not sure if I understood you right, but cfq is taking less time (and not more time). The cfq scheduler more like implements the nice/renice process features for IO. Starting with cfq3 you can use ionice to set the IO priority of a process. Basically all scheduler reorder requests (except noop). The algorithm or rule set behind them is important.

      The AS scheduling policies can be taken from the Linux kernel documentation: Documentation/block/as-iosched.txt (as i don't like to copy and paste the whole text file in here)

      Actually, there is a nice document available from Hao-Ran Liu focusing on these schedulers. Just type his name into any search engine. On his homepage there are several pdf linked.

      The machine had 4GB of RAM and 4GB of SWAP installed.

      Author's profile photo Former Member
      Former Member
      Hi Hannes,
      That was a typo! I actually meant deadline and not cfq. But anyways you answered that thing. Thanks.

      I have another question. i do not know whether it fits here or not but still I will ask it 🙂
      "Starting with cfq3 you can use ionice to set the IO priority of a process" ... This means that every I/O process will have a nice value. But How does the kernel know whether the process needs I/O or not? Consider this code

      int main(int argc, char **argv)
      if (argc = 1)
      scanf("%s", argv[1])
      strcpy(argv[1],"hello World");
      printf("%s", argv[1]);

      Note: This will generate dump in gcc. 🙂 but will run in turbo C. compiler.

      Now how does the compiler decide whether this process should have ionice value or not, since this process can run in various ways. Please note that I am considering only Input part of I/O.

      I do not have GNU/Linux system installed on my PC. Since you are SAP employee,you can email me that file. I would be grateful.

      I have generally noticed and actually implemented the fact that Swap should always be half of physical RAM. Why have you decided to keep the value same for both of them? Any performance issues? 

      Author's profile photo Hannes Kuehnemund
      Hannes Kuehnemund
      Blog Post Author
      Hi Puru,

      with cfqv3 the scheduling policy will almost stay the same like in cfqv2, but now you have the possibility to redefine the IO priority. Thus compiling a programm on a filesystem with the cfqv3 scheduler will take the same time like on cfqv2 filesystems.

      The availability of renice IO instead of CPU may be used on a database server, where you give the database processes a higher IO priority then other processes. The use of ionice is up to you. It is used like renice for CPU time, which is also called manually.

      The SWAP/RAM ratio for SAP system should at least be 50% or higher. With 4GB of SWAP in the testing machine we can still add another 4GB of RAM into this machine without repartitioning the disks, which would take very long.