A quick post of an issue for VMware users of Intel’s new 10Gbps X540-AT2 model NICs.
We began receiving alerts of high disk latency and network retransmits for a customer’s VMware VSAN Cluster. Further investigation showed slow disk performance for Ethernet based storage VSAN/NFS/iSCSI and vMotion failures. VSAN logs also showed RDT issues.
The ixgbe driver was at version level 126.96.36.199.14iov-NAPI (Default for ESXi 5.5 Update 2) and the firmware was at version 16.0.24. The combination of the newer firmware with the older driver was causing the performance problems that were being observed. To test this we did a reinitiation of the NICs affected, and once they came back online performance saw a marked improvement briefly, before performance issues began to arise again. Updating the drivers to 3.21.4iov resolved performance issues permanently.
Our VMware support desk has observed with IO devices (LSI HBA’s, Network Adapters) if the firmware and drivers are not kept in “perfect alignment” unexpected bugs, failures and crashes are cropping up. Helping us identify these issues before they become a problem has been good monitoring of both performance counters and ESXi host logs. Establishing healthy baselines and doing internal testing for new updates is key to safely updating customer ESXi hosts.
As a VMware service provider we operate a hosted log analysis server that our managed services customers can push to. We are aggregating these logs and using any customer outages or other impactful events to identify and predict failure in all of our monitored customers environments.