Monitoring VMware vSphere key metrics

Preamble

VMware vSphere is a virtualization technology that lets users create and operate virtual machines (VMs) on real servers using the underlying resources. Organizations can use vSphere to reduce expenses, centralize infrastructure management, and create fault-tolerant virtual environments. Instead of using a single physical server for each application, virtualization allows you to divide a server’s resources among multiple virtual machines (VMs), allowing you to run multiple isolated operating systems with different workloads on a single machine, allowing for more efficient use of physical resources and lower storage and hardware maintenance costs.

The Distributed Resource Scheduler (DRS), which employs vMotion to automatically distribute shared physical resources to VMs based on their demands, is one of VMware’s VM cluster management tools. When a server is expected to be unavailable for some time (e.g., for maintenance) or is overwhelmed, it can use vMotion to relocate a VM to another server with no downtime. DRS and vMotion make your virtual environment more resilient and fault-tolerant.

Suppose your company uses vSphere to run applications. In that case, you must monitor your environment’s overall performance and capacity at many tiers, including the VMs that run workloads and the underlying hosts. It ensures that your vSphere infrastructure’s available resources are sufficient to fulfill the demands of the apps and services operating on it.

Performance and capacity management are inextricably linked. If your apps and workloads encounter a bottleneck, you may face poor performance or even downtime if you lack the requisite resource capacity. Monitoring can assist vSphere administrators with rightsizing virtual machines so that resources are apportioned optimally. If you’re a developer, keeping an eye on vSphere will guarantee that your VM-based apps behave as expected.

This post will go over some essential metrics that can help you understand your vSphere infrastructure’s health, performance, and capacity. It includes metrics from your vSphere infrastructure’s physical and virtual components, which are separated into the following categories:

Metrics in summary
CPU statistics
Metrics for memory
Metrics on the hard drive
Metrics on the network

Let’s look at how vSphere works before we get into these metrics. You can also go straight to the numbers.

Monitoring VMware vSphere key metrics

VSphere is a virtualization platform made up of several different components. There are two key components to be mindful of when monitoring:

Hypervisors based on ESXi
The ESXi hypervisors of the vCenter Server

The bare-metal hypervisor ESXi runs on each physical Server and allows vSphere to run virtual machines on it. ESXi hypervisors run the VMKernel operating system on their underlying bare metal hosts. The VMKernel is in charge of detaching resources from the servers on which ESXi hypervisors are installed and supplying them to virtual machines.

ESXi hosts are servers that run the ESXi hypervisor. By default, ESXi hosts assign physical resources to each running VM based on various factors, including the resources available (on a single host or across several hosts), the number of VMs presently running, and the VMs’ resource consumption. There are three options you can adjust to manage and optimize how the ESXi host allocates resources if resources are overcommitted (i.e., total resource allocation exceeds capacity):

You can use shares to prioritize particular VMs by establishing their proportionate claim to resources. A VM having half as many shares of memory as another, for example, can only utilize half as much memory.
Reservations specify a guaranteed minimum quantity of resources that the ESXi host will allocate to a virtual machine.
Limits specify how much of a resource the ESXi host will allot to each VM.

VMs will have shares based on their allotted resources by default. In other words, a VM with double the vCPU of another VM will have twice the number of CPU shares. Even if a VM has much more shares, the hosting will not allocate resources more than its given limit due to any defined reservations or limitations.

You may also use resource pools to divide the physical resources of one or more ESXi hosts into logical units. The resource pools are arranged in a hierarchical order. A parent pool can include one or more virtual machines (VMs), or it can be divided into a kid pool of resources that share the help of the parent pool. Can further partition each of the kid resource pools. Resource pools give vSphere resource management more flexibility, allowing you to separate resource use throughout your enterprise (e.g., different departments and administrators are assigned their resource pools).

The vCenter Server is a virtualization server.

For ESXi hosts, the vCenter Server is the centralized management component of vSphere. The vCenter Server allows vSphere administrators to keep track of the health and condition of all their associated ESXi hosts. The vCenter Server also offers a centralized tool for creating, configuring and monitoring virtual machines on the ESXi hosts it controls. Can create clusters of ESXi hosts managed by a single vCenter Server. Virtual machines (VMs) on hosts in the same cluster share resources such as CPU, memory, storage, and network bandwidth.

The vCenter Server can be installed in two ways. You can install and run it on a physical or virtual server running Microsoft Windows Server 2008 SP2 or later or utilize a vCenter Server Appliance (vCSA), a Linux virtual machine specialized to host vCenter Server. A data center cluster is depicted in the diagram below. The collection consists of three ESXi hosts, each of which hosts two virtual machines (including the vCenter Server), each running its applications and operating systems.

Key vSphere metrics to monitor

Let’s look at the essential metrics you’ll want to pay special attention to monitoring your vSphere environment at the VM, host, and cluster levels. We’re familiar with several vSphere’s main components its overall design. While vSphere generates hundreds of metrics, we’ve selected a few critical ones to pay attention to. As previously stated, these measures can be divided into five categories:

Summary metrics provide high-level information on the size and health of the infrastructure.
Memory metrics that track swapping, ballooning, and overhead disk measurements provide visibility into disk health and performance network metrics that track network activity and throughput CPU metrics that track utilization, availability, and readiness.

We’ll also look at vSphere events, which give information about cluster activity and the health and state of the virtual environment’s components.

It’s critical to watch the status and performance of each tier of your environment when monitoring vSphere, from VMs to the ESXi hosts that execute them to the clusters that make up your infrastructure. It’s important to remember which portions of your system give resources and use them. Virtual machines, for example, use resources from physical resources such as ESXi hosts and clusters. Because they can be assigned the function of both parent and kid simultaneously, resource pools can both provide and consume resources (i.e., the child of one resource pool can be the parent of another). When you’re monitoring vSphere, you’ll want to make sure that resources are readily available, that they’re being used efficiently, and that specific areas of your infrastructure aren’t consuming too much of them at the expense of the rest of your environment.

Our Monitoring 101 series, which provides a framework for metric collecting and alerting, is referenced in this article.

Metrics in summary

Summary metrics from vSphere give you a high-level picture of the size and health of your infrastructure, including the number of clusters in your environment and the number of hosts and virtual machines that are currently active. You can make better allocation decisions if you keep track of the size of your vSphere environment. You’ll know how many hosts and VMs will require resources.

Metric	Description	Metric type
Host count	Total number of hosts in your environment	Other
VM count	Total number of virtual machines in your environment	Other

Number of ESXi hosts and virtual machines to keep an eye on

Total counts of ESXi hosts and virtual machines can give you a good idea of how healthy your vSphere environment is. It’s worth looking into if there are significant differences in the stated number of hosts or VMs differs dramatically from what you expect to have. If the number of virtual machines (VMs) drops unexpectedly, it could indicate misconfiguration or resource contention on the hosts. You can troubleshoot the cause of VMs by looking into your vSphere logs.

CPU statistics

There are two types of CPU metrics to examine in vSphere: physical CPU (pCPU) and virtual CPU (vCPU) (vCPU). The number of processors accessible on physical hosts is referred to as pCPU. In contrast, the number of logical processors available on a host that are assigned to a virtual machine is referred to as vCPU.

While a VM views vCPU as its physical processing capacity, any workloads it runs on its host’s pCPU. ESXi hosts schedule VM workloads across all available pCPUs by default, which means that all VMs on the host share processor time. There is no contention if the total number of assigned vCPUs across all VMs is equal to or less than the total number of pCPUs available. However, because, unusually, all VMs will demand 100% of their vCPUs simultaneously, it’s typical for the number of allocated vCPUs across all VMs to be more than the number of available pCPUs (i.e., overcommitted) for more effective resource usage. However, if that ratio rises, more VMs will wait for each other to finish running before they can access the actual CPU. The longer VMs wait for CPU access, the longer tasks will take to complete and the lower the total VM performance.

It’s critical to collect CPU measurements from physical and virtual levels to monitor your vSphere installation successfully. Knowing how much CPU is available on your hosts and across clusters and how much your virtual machines are using. It will help you determine if your virtual environment is functioning well and whether you need to scale it up or down and change CPU allocation to specific VMs by establishing shares, reservations, and constraints.

Metric	Description	Metric type
`cpu.readiness.avg`	Average percentage of time a VM is spending in a ready state, waiting to access pCPU	Resource: Saturation
`cpu.wait`	Total amount of time (ms) a VM is spending in a wait state (i.e., VM has access to CPU but waiting on additional VMKernel operations)	Work: Performance
`cpu.usage.avg`	Percentage of an ESXi host’s pCPU capacity being used by the VMs running on it	Resource: Utilization
`cpu.TotalCapacity.avg`	Total pCPU capacity (MHz) of an ESXi host available to VMs	Resource: Availability

CPU readiness is a metric to keep an eye on.

The CPU readiness statistic monitors the time a virtual machine is ready to run a workload but must wait for the ESXi host to schedule it due to a lack of actual CPU. Monitoring CPU readiness time can help you determine whether your virtual machines are performing efficiently or wasting too much time waiting and unable to complete their tasks.

While some CPU readiness time is expected, VMware recommends that you issue an alert if this measure exceeds 5%. VMs that spend a large portion of their time in the ready state may be unable to execute tasks, resulting in poor application performance, timeout problems, and possible downtime.

Too many VMs competing for CPU on the same ESXi host is the main reason for extended readiness times. Other factors, on the other hand, can play a role. For example, you might be using CPU Affinity, which allows you to assign VMs to a specific portion of the ESXi host’s CPUs. In some circumstances, CPU Affinity can aid application optimization by ensuring that virtual machines performing specific workloads are assigned to particular physical CPUs. However, setting CPU Affinity on many VMs may result in high CPU readiness numbers as the VMs compete for the same processor. Limit the number of VMs with CPU Affinity enabled to avoid this.

CPU readiness could also result from VMs having too strict CPU restrictions. Though limitations can help avoid over-allocating CPU, if the limits are too low to allow for spikes in consumption, the ESXi host may be unable to schedule a VM’s workload if it requests more CPU than the limit allows.

CPU wait time is a metric to keep an eye on.

The CPU wait measure informs you how much time a VM scheduled by the ESXi host is idle or waiting for VMKernel activities to complete before executing. In contrast, CPU readiness is the percentage of time a VM remains for the available CPU. I/O operations and memory swapping are two VMkernel actions that might increase CPU wait time.

A long CPU wait isn’t always indicative of a problem. Because the CPU wait measure combines a VM’s idle time (CPU. idle) with time spent waiting for the VMKernel to perform individual tasks, a high number in this metric may indicate that the VM that has completed its studies is thus idle. To see if excessive wait times are caused by I/O activities or memory swapping, check out the difference between the reported CPU waiting period and CPU idle time.

If you’ve determined that excessive CPU wait times are due to VMKernel activity, this could result in poor VM performance, and you should look into the memory and disk metrics we’ll look at later in this post to figure out what’s causing it.

CPU use is a metric to keep an eye on.

CPU use is an important metric to track at different levels since it can indicate your vSphere environment’s performance. Cpu. Usage is a host-level variable. The average statistic allows managers to see how much of an ESXi host’s physical CPU is being used by the virtual machines. If VMs consume a significant percentage of a host’s CPU (e.g., more than 90%), the host’s CPU readiness may grow, resulting in latency difficulties as VMs compete for resources.

It’d be helpful to keep a close eye on CPU usage at the virtual machine level. It may be usual for VMs on specific hosts to use CPU near capacity depending on the workload they are running (e.g., for scheduled high workloads), so it is vital to monitor this statistic to establish a baseline and then search for abnormal behavior.

Monitoring CPU use on both your hosts and individual VMs can help you spot problems like underutilized hosts or VMs that aren’t in the best location. Suppose your VMs are continuously utilizing a lot of CPU, for example. In that case, you’ll need to either scale up your ESXi hosts, alter the CPU allocation parameters of your VMs, or, if the VMs are operating on a solitary server, join a cluster to get more CPU.

CPU total capacity is a metric to keep an eye on.

Because all virtual machines on the same host or cluster share the same pCPU, overall capacity can be a shared resource constraint and an excellent location to look for performance issues. Total capacity in vSphere refers to the total amount of pCPU available to schedule to VMs, measured in megahertz. The physical capacity determines this (number of processors and cores) and the specs of your ESXi hosts (for example, whether they support hyperthreading).

Can view this measure at the per-host or per-cluster level for a more comprehensive picture of available CPU resources. If you’ve been told of high CPU readiness, it could signify that your host’s overall available capacity is low, and VMs have been forced to wait longer for CPU access. For example, you can fix this by joining your ESXi host to a cluster with more CPU capacity.

Metrics for memory

Memory, like CPU, can be a significant resource bottleneck in virtual environments when multiple VMs must share a limited underlying capacity. In vSphere, there are three tiers of storage to be mindful of:

Physical memory should be hosted (the memory available to ESXi hypervisors from underlying hosts).
Physical guest memory (the memory available to operating systems running on VMs).
Guest virtual memory (the memory available to operating systems running on VMs) (the memory available at the application level of a VM).

Each virtual machine (VM) has a set amount of physical RAM that the guest operating system can use. This configured size is not the same as the amount of RAM the host will allot to it, which is determined by the VM’s requirements and any configured shares, limitations, or reservations. For instance, despite a VM’s specified capacity of 2 GB, the ESXi host may only need to allocate 1 GB due to its actual workload (i.e., any running applications or processes). It’s worth noting that if a VM’s memory limit isn’t defined, its configured size becomes the default limit.

When a virtual machine starts, the underlying host’s ESXi hypervisor constructs a set of memory addresses that match the memory addresses supplied to the virtual machine’s guest operating system. When an application running on a VM tries to read from or write to a memory page, the VM’s guest OS acts as a non-virtualized system, translating between guest virtual memory and guest physical memory. On the other hand, the guest OS does not have access to the host’s physical memory and hence cannot allocate it. Instead, the ESXi hypervisor on the host intercepts memory requests and maps them to the physical host’s memory. ESXi additionally keeps track of each memory translation (called shadow page tables): guest virtual to guest physical, and guest physical to host physical. It maintains memory consistency across all tiers.

Because of this method of memory virtualization, each VM only sees its memory usage. However, the ESXi host may allocate and manage memory for all running VMs. On the other hand, the ESXi host has no way of knowing when a VM frees up or deallocates physical guest memory. A VM also has no idea when the ESXi host needs to allocate memory to other virtual machines. It is not a problem if the total physical memory used by all running VMs (plus any necessary overhead memory use) is equal to the host’s physical memory. However, ESXi hosts will use memory reclamation techniques like swapping and ballooning to recover free memory from VMs and allot it to other VMs when memory is overcommitted. Resource overcommitment and memory reclamation strategies can assist optimize memory utilization. Still, it’s also crucial to keep an eye on metrics that track ballooning and swapping, as either can cause VM performance to suffer. In the relevant metrics sections below, we’ll go over these processes further.

Metric	Description	Metric type
`mem.vmmemctl`	Amount of memory (KiB) in the memory balloon driver that the host will reclaim when it’s low on memory	Resource: Saturation
`mem.swapin`	Amount of memory (KiB) an ESXi host swaps in to a VM from disk (physical storage)	Resource: Saturation
`mem.swapout`	Amount of memory (KiB) an ESXi host swaps out from a VM to disk (physical storage)	Resource: Saturation
`mem.active`	Amount of allocated memory (KiB) a host’s VMkernel estimates a VM is actively using	Resource: Utilization
`mem.consumed`	Amount of memory of host physical memory (KiB) that is actually allocated to a VM	Resource: Utilization
`mem.usage.avg`	Percentage of a VM’s total configured memory actively used by the VM Percentage of underlying host memory that’s actually been allocated to VMs	Resource: Utilization
`mem.TotalCapacity.avg`	Amount of memory of physical host (MiB) reserved for and available to VMs	Resource: Utilization

The capacity of the balloon driver (vmmemctl) is a metric to be aware of.

Can deploy a balloon driver (called vmmemctl) on each VM in vSphere. If an ESXi host runs out of physical memory to allot (i.e., less than 6% free memory), it can reclaim memory from virtual machines’ guest physical memory. The ESXi hypervisors send requests to the balloon drivers to “inflate” the VM by gathering unused memory because they have no notion of what memory is no longer in use by VMs. The ESXi host can then deallocate the necessary mapped host physical memory and allocate it to additional VMs using the memory from the “inflated” balloon driver. Memory ballooning is the term for this approach.

While ballooning can aid ESXi hosts with memory constraints, it can potentially degrade VM performance if the guest operating system later can’t access memory that the balloon driver has saved and retrieved by the host. If ballooning isn’t enough, ESXi hosts may resort to swapping memory to meet VM memory demands, resulting in a significant decrease in application performance.

Should avoid memory ballooning in most cases if your environment is healthy and virtual machines are correctly scaled. As a result, any positive number for mem.vmmemctl should trigger an alert, indicating that the ESXi host is running out of memory.

Memory swapped in/out is a metric to be aware of.

When an ESXi host creates a virtual machine, it allocates swap files, actual disk storage files. The swap file size is determined by the virtual machine’s set size, minus any reserved memory. For example, if a VM has 3 GB of memory and a 1 GB reserve, it will have a swap file of 2 GB. A VM’s swap files are stored on shared storage alongside its virtual disk by default.

Suppose a host’s physical memory becomes low and memory ballooning isn’t reclaiming enough memory to satisfy requests quickly enough. In that case, ESXi hosts will start reading and writing data that would typically go to memory using swap space. Memory swapping is the term for this procedure.

Memory swapping should be the last option because reading and writing data to disk takes far longer than using memory and can significantly slow down a VM. Set alarms to tell you of any surges in swapping and decide how to resize virtual machines if necessary to keep switching to a minimal. If you see more swapping, you should also examine the status of the VM balloon drivers, as swapping could mean that ballooning failed to reclaim enough memory.

Active memory vs. consumed memory is two metrics to keep an eye on.

A VMKernel would have to monitor every memory page that has been read from or written to precisely determine how much memory is actively in use by VMs. On the other hand, this procedure would necessitate far too much overhead. Instead, the VMKernel estimates each VM’s active memory utilization via algorithmic learning. The mem.consumed metric in the VMKernel reports this estimate in KBs. The mem.consumed statistic represents the amount of memory allocated to a VM from the underlying host.

Active memory is an excellent real-time indicator of your virtual machines’ memory utilization, and monitoring it with consumed memory can help you figure out if they have enough memory. Suppose a VM’s active memory is continuously more petite than its consumed memory. In that case, it suggests it has more memory allotted to it than it requires, and as a result, the host has less memory available for other VMs. Consider altering your virtual machine’s size or memory reservation to fix this.

Memory utilization is a metric to keep an eye on.

The mem.usage metric monitors how much of a VM’s specified memory is currently used at the VM level. A virtual machine should not always use all of its available memory. If its ESXi host cannot allocate extra memory, the VM will be less resilient to memory spikes if it constantly uses a substantial amount of its configured memory. Consider resizing the VM’s RAM, altering its memory allocation settings (shares, reservations, etc. ), or transferring the VM to a cluster with additional memory if this is the case.

Memory consumption at the host level refers to how much of an ESXi host’s physical memory is being used. Suppose memory utilization on the host is persistently high. It may be unable to provide memory to the VMs that require it, necessitating more frequent memory ballooning or even memory swapping.

Metrics on the hard drive

Virtual machines store their operating system files and guest applications in huge files (or groups of files) known as virtual disks (also known as VMDK or Virtual Machine Disk files). Virtual machines come with one virtual drive by default, but you can expand it. Virtual disks are stored in datastores, which, depending on configuration, might be located in a variety of shared storage locations. Cloud storage, storage area networks (SANs), and logical unit number (LUN) storage devices are all storage solutions for datastores.

Datastores, virtual machines, and ESXi hosts all have disk I/O and capacity metrics that VSphere reports on. Because numerous hosts and VMs can share datastores, monitoring at the datastore level can give you a high-level, aggregated picture of disk performance. However, if you want to keep track of the health of a single VM or host, you need to keep an eye on the performance of both virtual disks and physical disks. Monitoring disk metrics at each of these levels will help you better understand your cluster’s health and identify issues.

Virtual machines use storage controllers to access the virtual disks in a datastore. Storage controllers allow virtual machines to transmit commands to the ESXi host they’re operating on and then route them to the proper virtual disk because of VMs transfer instructions to datastores through ESXi hosts. Monitoring metrics that would provide insight into command latency and throughput can help you ensure that hosts and VMs can access storage space effectively and without interruption.

Metric	Description	Metric type
`disk.commandsAborted`	Total number of I/O commands aborted by the ESXi host	Work: Error
`disk.busReset`	Number of disk bus reset commands by the virtual machine	Work: Error
`diskspace.provisioned.latest`	Amount of storage (KB) available in a datastore	Resource: Utilization
`virtualDisk.actualUsage`	Amount of datastore storage (KB) that is actually being used by the VMs running on a host	Resource: Utilization
`disk.totalLatency.avg`	Average amount of time (ms) it takes an ESXi host to process a command issued by a VM	Work: Performance
`<component>.readLatency.avg`	Average amount of time (ms) it takes the specified component to process a read command	Work: Performance
`<component>.writeLatency.avg`	Average amount of time (ms) it takes the specified component to process a write command	Work: Performance
`disk.queueLatency.avg`	Average amount of time (ms) each I/O command spends in VMkernel queue before being executed	Work: Performance
`<component>.read.avg`	Average amount of data (KB/s) read by the specified component	Work: Throughput
`<component>.write.avg`	Average amount of data (KB/s) written to a specified component	Work: Throughput
`disk.usage.avg`	Average disk I/O (KB/s) of a specified component	Work: Throughput

Disk commands aborted is a metric to be aware of.

A single storage device cluster in vSphere can hold datastores that support many virtual machines. The storage hardware where datastores are situated may become overcrowded and unresponsive if there is a rush of commands from virtual machines. The ESXi host will abort the powers sent to them if this happens. You’ll need the disk because aborted orders might cause VMs to run slowly or crash. The commandAborted metric was set to zero. You can relocate VMs across several storage backends to avoid sending all requests to a single datastore if an ESXi host begins to abort commands. You’ve identified the cause is disproportionately high VM command traffic to the datastore.

Disk bus resets are a metric to keep an eye on.

If a storage device becomes overburdened with reading and writing commands from an ESXi host, or if it experiences a hardware failure and cannot abort, it will empty all orders in its queue. A disk bus reset is what this is known as. Disk bus resets indicate a disk storage bottleneck and can delay VM performance requiring VMs to resubmit requests. Disk bus resets are uncommon in healthy vSphere settings, so any VM with a positive result for the disk.bus.reset metric should be investigated. Administrators may need to employ Storage vMotion to redistribute VMs and virtual disks across multiple datastores to maximize performance to remedy this issue.

Datastore provisioned capacity and actual VM use are two metrics to watch.

There is a limit to how much storage you can have. The diskspace.provisioned property is set to true. The current metric measures how much storage space is available on the datastores with which the ESXi host communicates, whereas virtualDisk measures how much storage space is available on the datastores with which the ESXi host talks. actualUsage allows you to keep track of how much disk space the VMs on that host are using. Correlating these indicators might help you determine if you’ve allocated enough disk space for your virtual machines.

Using nearly all of a datastore’s disk capacity can result in out-of-space issues and VM performance degradation. To avoid this, put up an alert for when VM utilization of your datastore’s allotted storage capacity exceeds a certain threshold (e.g., over 85 percent). Consider expanding the datastore’s capacity, relocating VMs to another datastore, or deleting inactive VMs with virtual disks that are using up storage space.

Disk latency is a metric to keep an eye on.

Monitoring latency is essential for ensuring that your virtual machines communicate with their virtual disks quickly and efficiently. Total disk latency is the time it takes an ESXi host to complete a request received from a VM to a datastore in milliseconds. Monitoring total disk latency might help you determine if vSphere is working correctly. Latency spikes or persistently high latency are good symptoms of something wrong with your environment. Still, they can be caused by a range of things, such as resource bottlenecks or application-level issues.

Suppose you have a problem with total latency. You can check the average latency of reading (disk.readLatency.avg) and write (disk.writeLatency.avg) operations to see if one or the other is causing the problem. You can also break into reading and write latencies at the VM, host, and datastore levels to see which inventory objects contribute to the overall delay rise.

Correlating high disk latency with other resource consumption measurements can help determine whether the root reason is a memory or CPU shortage. In that situation, you may figure out which virtual machines on your host or cluster are using the most resources and either provide more resources or transfer them to larger datastores. You can also look at queue latency to see whether a rise in requests queued but not processed came before the increase in latency.

Queue latency is a metric to keep an eye on.

Storage devices, such as LUNs, have a maximum number of commands they may queue at any time, depending on their configuration. When the number of virtual machine orders delivered from an ESXi host exceeds the capacity of the storage device, the commands are queued in the VMKernel. It’s the disk. The queueLatency statistic measures how long orders from a VM spend waiting in a VMkernel’s queue. The longer a command sits in a line waiting to be processed by the disk, the lower the performance of the VM that submitted it. Because orders must usually stay in a row because the storage device takes a long time to process current commands, high queue latency is directly linked to high overall latency.

To better understand your environment’s performance, monitor queue latency alongside disk.usage.avg. You can, for example, see if a spike in queue latency corresponds to a decline in overall throughput. Similarly, you can notice if increased throughput preceded a spike in queue latency due to your datastore’s inability to handle the increased load.

Queue latency can be reduced by transferring VMs to a datastore with more disk space, raising the queue depth of your datastore, or activating storage I/O management, just like total latency. With storage I/O control enabled, you may assign VMs storage resource shares and set a command latency threshold above which vSphere will begin distributing storage to VMs based on their claims. It can reduce queue latency and relieve I/O pressure.

Disk throughput is a metric to keep an eye on.

Monitor their I/O throughput for visibility into their activity to ensure that your datastores, ESXi hosts, and VMs are processing read and write requests without interruption. Monitoring throughput at several levels and comparing it to other data can help you identify bottlenecks and determine the source of a problem. If a rise in VM read commands delivered to an ESXi host occurs before a spike in total latency, it could suggest that the ESXi host is struggling to cope with the influx of requests.

While there are various reasons for a continuous increase in throughput, you can reduce the problem by adding extra memory to your VMs. As a result, virtual machines will be able to cache more data and rely less on swapping.

Metrics on the network

A vSphere environment consists of one or more networks of logically connected virtual machines running on the same host and a network of physically connected servers running the ESXi hosts. To ensure robust connectivity, it’s critical to collect usage and error metrics from both real and virtual networks across your environment. Network connectivity issues can prevent you from doing essential tasks of vsphere that need network communication, such as VM provisioning and migration.

Monitoring the network throughput of your hosts and virtual machines can help you figure out whether your network is usually performing or whether you need to change your network settings.

The network received and network broadcast is two metrics to watch.

These metrics track the network throughput of the item you’re looking at in kilobytes per second, whether a host or a virtual machine. Together with the total network utilization measure (net.usage.avg), these indicators can provide you with a good idea of how much traffic flows between your ESXi hosts and VMs. After establishing a network behavior baseline, you can set up an alert to tell you of any deviations (i.e., spikes or drops) that could indicate underlying hardware issues (e.g., a lost host connection) or a misconfigured Windows or Linux VM.

Tasks and occasions

vSphere captures actions and events in your virtual environment’s VMs, ESXi hosts, and vCenter Server by default. Examples are user logins, VM power-downs, certification expirations, and host connects/disconnects. vSphere tasks and events provide a high-level picture of your virtual environment’s health and activity, reporting failures and errors to alert you when your environment is unhealthy.

Because you can schedule tasks, you can see where they are in the execution cycle by checking their status. Monitoring activities and events can also reveal how changes in your environment, such as virtual machine initialization, affect resource utilization and cause conflict.

Tasks and events are recorded in log files saved in multiple locations by each ESXi host. The following are a few essential files:

/var/log/vmkernel.log: VMKernel logs, which provide information about device discovery, storage, network activities, and VM startup.
/var/log/syslog.log: Data on scheduled tasks and interactions with ESXi hosts.
Authentication logs, including data on user logins and logouts, can be found in /var/log/auth.log.
/vmfs/volumes/datastore/virtual machine>/vwmare.log: Logs for individual virtual machines, such as migrations and virtual hardware changes.

Monitoring the events in these log files can help you keep track of overall activity in your vSphere clusters and conduct audits and analyze any issues that arise.

Get visibility into your virtual environment and its supporting hardware

We covered the major components of vSphere in this post and several essential metrics to verify your environment has the resources it requires and is working as intended. We’ve also talked about how monitoring events in conjunction with critical indicators can provide you with a high-level picture of how your environment is doing.

Enteros

About Enteros

IT organizations routinely spend days and weeks troubleshooting production database performance issues across multitudes of critical business systems. Fast and reliable resolution of database performance problems by Enteros enables businesses to generate and save millions of direct revenue, minimize waste of employees’ productivity, reduce the number of licenses, servers, and cloud resources and maximize the productivity of the application, database, and IT operations teams.

The views expressed on this blog are those of the author and do not necessarily reflect the opinions of Enteros Inc. This blog may contain links to the content of third-party sites. By providing such links, Enteros Inc. does not adopt, guarantee, approve, or endorse the information, views, or products available on such sites.

Are you interested in writing for Enteros’ Blog? Please send us a pitch!

Optimizing Enterprise Performance in the Telecom Sector: How Enteros Drives Cloud FinOps Excellence

17 April 2025
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Accelerating Performance Growth in the Insurance Sector with Enteros: Uniting Database Optimization and RevOps Efficiency

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Harnessing Enteros and Generative AI to Empower Database Administrators in the Hospitality Sector through a Scalable SaaS Platform

16 April 2025
Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Driving Cost Attribution and Performance Efficiency in the Travel Sector with AIOps and Cloud-Based Database Platforms

Database Performance Management

In the fast-evolving world of finance, where banking and insurance sectors rely on massive data streams for real-time decisions, efficient anomaly man…

Preamble

Monitoring VMware vSphere key metrics

Key vSphere metrics to monitor

CPU readiness is a metric to keep an eye on.

Get visibility into your virtual environment and its supporting hardware

Enteros

RELATED POSTS

Optimizing Enterprise Performance in the Telecom Sector: How Enteros Drives Cloud FinOps Excellence

Accelerating Performance Growth in the Insurance Sector with Enteros: Uniting Database Optimization and RevOps Efficiency

Harnessing Enteros and Generative AI to Empower Database Administrators in the Hospitality Sector through a Scalable SaaS Platform

Driving Cost Attribution and Performance Efficiency in the Travel Sector with AIOps and Cloud-Based Database Platforms