Monday, October 10, 2016

Storage performance troubleshooting with ESXTOP [Guide]

As you know ESXTOP is an utility bundled with ESXi allowing to monitor/troubleshoot performance of network, CPU or storage. This post is about storage performance troubleshooting with ESXTOP.

Se will focus on storage as many times the storage is the main problem of latency. A weakest performance element in the whole chain. VMs performs slowly, but where the latency comes from? Is it at the VM level, LUN level or Disk level (hba).

While ESXTOP is command line utility, note that there is a nice free tool which has GUI, from VMware called Visual ESXTOP. It integrates into vSphere web client. Note that I have not personaly tested the tool with the latest vSphere 6.0 U1 release.

What to monitor/troubleshoot?
Per LUN
Per VM
per Disk (HBA mode)
Let’s monitor a LUN with ESXTOP
1. Start ESXTOP and press U to switch to disk view (LUN mode).




2. Press F (Field Order) to modify fields which you want to display. Then hit Enter to validate.


3. Press S and then 3 (or other smaller/bigger value) to set the auto-update time to every 3 seconds…. In order to view the whole device name (the complete naa identifier) you’ll have to enlarge the column pres Shift + L and enter “32”. (or other larger number).


Let’s try to monitor Disk View (hba mode).
1. Start ESXTOP utility and press D to switch to Disk view (hba mode). In order to view the whole device name (the complete naa identifier) you’ll have to enlarge the column pres Shift + L and enter “32”. (or other larger number).


2. From here you can hit F (Field Order) to modify fields which you want to display. (You can see small star diplayed next to each visible field….). When OK, you can hit Enter.


3. Press S and then 3 to set the auto-update time for 3 sec. (you can enter smaller/bigger value as you wish).


Monitor VM performance (Per VM)
1. Start ESXTOP and press  V  to switch to disk view.


2. Again, Press F (Field Order) to modify fields which you want to display. Then hit Enter to validate.

3. And again, press S and then 3 (or other smaller/bigger value) to set the auto-update time to every 3 seconds…. In order to view the whole device name (the complete naa identifier) you’ll have to enlarge the column pres Shift + L and enter “32”. (or other larger number).


What represents the different columns?
Now let’s start with ESXTOP utility by identifying the different columns.

CMDS/s –  sum of commands per second with IOPS (Input/Output Operations Per Second). Here are also other SCSI commands like SCSI reservations, locks, vendor string requests, unit attention commands etc. All those are flowing to or are coming from the device or virtual machine which is monitored.

DAVG/cmd  –  Average response time in milliseconds per command which is sent to the device.

KAVG/cmd –  How many time this command spend in the VMkernel.

GAVG/cmd  – Response time at the guest operating system level. Here comes a formula: DAVG + KAVG = GAVG
[10/9, 16:45] vExpert-Jayesh: CPU

When troubleshooting CPU performance for your virtual machines the following counters are the most important.

%USED, %RDY, %CSTP

%USED tells you how much time did the virtual machine spend executing CPU cycles on the physical CPU.

%RDY is a Key Performance Indicator! Always start with this one. This one defines how much time your virtual machine wanted to execute CPU cycles but could not get access to the physical CPU. It tells you how much time did you spend in a “queue”. I normally expect this value to be better than 5% (this equals 1000ms in the vCenter Performance Graphs read about it here)

%CSTP tells you how much time a virtual machine is waiting for a virtual machine with multiple vCPU to catch up. If this number is higher than 3% you should consider lowering the amount of vCPU in your virtual machine.
[10/9, 16:45] vExpert-Jayesh: Memory

When troubleshooting memory performance this is the counters you want to focus on from a virtual machine perspective.

MCTL?, MCTLSZ, SWCUR, SWR/s, SWW/s

MCTL? This column is either YES or NO. If Yes it means that the balloon driver is installed. The Balloon driver is automatically installed with VMware tools and should be in every virtual machine. If it says No in this column then figure out why.

MCTLSZ The column show you how inflated the balloon is in the virtual machine. If it says 500MB it translates to the balloon driver inside the guest operating system has “stolen” 500MB from Windows/Linux etc. You would expect to see a value of 0 (zero) in this column

SWCUR tells you how much memory the virtual machine has in the .vswp file.  If you see a number of 500MB here it means that 500MB is from the swap file. This does not necessarily equals to bad performance. To figure out if you virtual machine is suffering from hypervisor swapping you need to look at the next two counters. In a healthy environment you would want this value to på 0 (zero)

SWR/s This value tells you the Read activity to your swap file. If you see a number here, then your virtual machine is suffering from hypervisor swapping.

SWW/s This value tells you the Write activity to your swap file. You want to see the number 0 (zero) here. Every number above 0 is BAD.



No comments:

Post a Comment