Dell R730 pass-through in one of our deployments has been confirmed by VMware/Dell to cause issues. We are being advised to update to new firmware, and drivers for the controller and the back plane.
Symptoms of H730 Issue
Symptoms logged in Dell iDRAC:
- No Symptoms
VMware Level Symptoms:
- disks dropping and reporting unhealthy and removed.
- Error counts exceeded for drives (500).
- Reboots of host restored access to the drives and allowed for disk group rebuilds.
We have noticed a pattern – that it affects every host in the cluster approximately 60 days after its last reboot. Rebooting the host seems to clear the error counter and reset the time until next failure.
The H730 and OEMs
The Dell H730 platform is a custom solution based on the LSI/Avago 3108. It has been communicated by dell that they contract custom firmware and drivers for this product. This is similar issue to our previous discovered issues on the LSI/Avago MegaRaid 2208 family of products. We have not heard of issues with other 3108 based solutions but if you experience similar issues please contact VMware and your OEM provider for assistance.
Drivers, Firmware and Back Planes
We have heard from VMware as well as Dell that the following updates will fix these issues.
- ESXi 6.0 Driver: lsi_mr3 version 6.605.08.00-6vmw.600.0.0.2494585 ( Inbox 6.0 driver)
- Dell H730 P Firmware: 25.3.0.00016
- Dell R730 Backplane firmware version: 3.03
- H730 controller series with ESXi 5.5u2
- New recommended firmware version: 25.3.0.00016
- New recommended driver: megaraid_perc9 version 6.902.73.00
Mitigation/Remediation/Monitoring of issue
If you can not patch this, it would be recommended to reboot a host once a month to clear the counter until you can do so.
Our managed service teams are pushing patches out to impacted environments. Monitoring of customer host logs flagged this issue as they had dashboards setup for disk group failures and errors.