Increased rate of hanging Linux OS startups on some Microsoft Azure VM types
Please be informed that we have detected increased rate of hanging Linux OS startups when you use the following Microsoft Azure VM types: E32(S), E64(S), D14(S), D15(S), H16m. If you have not come on this problem yet, it is very likely to reproduce it after the next instance suspend and activate.
The symptoms are that upon VM sratup the CPU goes to 100% and you can’t login via SSH. If you want to dig deeper into the problem, you can enable the advanced diagnostics in the Azure portal to find repeating message like “[ 32.152002] BUG: soft lockup – CPU#0 stuck for 22s! [migration/0:8]” in the serial console.
The recommended solution to fix this problem is the following procedure with three steps: A, B and C:
A: Change these VM sizes to E16S, M32ls or M32ts to be able to start the OS. Proceed as follows:
- In SAP Cloud Appliance Library, choose Instances to display the list of available solution instances.
- Choose Edit for the solution instance you want to change.
- On the Virtual Machines tab page, edit the size of the virtual machines.
- Save your entries.
Note that the new VM size may have negative impact on the performance of your solution instance (E16S compared to E32S) or may imply a bit higher costs (M32ls and M32ts compared to E32S). However if you find the performance and the costs acceptable you can stop here. If you decide to proceed, please keep in mind that you can rely on community support only.
B: Follow the steps described in “Option I” or “Option II” depending on your preference:
Option I: Add the parameter “numa=off” to the kernel command-line. Proceed as follows:
- Log on to the Linux OS via SSH with the root user. For more information see, the “How to connect to a running instance via the secure shell protocol (SSH)?” question in this wiki page.
- In the file /etc/default/grub append the parameter “numa=off” to the value of property GRUB_CMDLINE_LINUX_DEFAULT
GRUB_CMDLINE_LINUX_DEFAULT=”xxxxxx xxxxxx=xxxxxx numa=off”
If the property doesn’t exist, add it to the file:
- Save the file.
- Execute the following command grub2-mkconfig -o /boot/grub2/grub.cfg and wait for the execution to finish.
Option II: Update the Linux kernel for each affected VM. Proceed as follows:
- Log on to the Linux OS via SSH with the root user.
- Check the kernel version in the command line with the following command: uname -r
For example, the system will display the following output: 3.12.74-60.64.69-default. We have tested that an upgrade to kernel version 4.4.138-94.39-default solves the problem.
- Update the kernel version with the following command: zypper ar http://smt-azure.susecloud.net/repo/SUSE/Updates/SLE-SERVER/12-SP3/x86_64/update?credentials=SMT-http_smt-azure_susecloud_net SLES12-SP3-Updates
- Refresh the zypper repositories with the command: zypper refresh
- Check the output that the newly added repository SLES12-SP3-Updates is up to date.
- Start the kernel update via the command: zypper update kernel-default
- When prompted type “y” and choose “Enter”. Wait for the kernel update to finish.
- Reboot the VM via the command: reboot
- After the operation is successfully finished, check the kernel version again with the command: uname -r
The output now should contain 4.4.138-94.39-default (or newer version depending on when you update the kernel).
C: Return the changed VM sizes in step A to their original VM size. For more information how to edit the size, see the step A in this blog.
Note that we recommend not to interrupt the SAP systems startup with the reboot or the resize operation. The SAP systems startup usually takes up to 10 minutes after the OS boot.