We’ve been having intermittent failures of our Hyper-V cluster recently, and it was, bizarrely, getting worse with time. 2 x PowerEdge R720s, 2 x PowerConnect 6224 switches (stacked), 2 x EqualLogic PS4100E boxes. The network on one node would fail completely- could be the LAN team, or the LiveMigration, or CSV etc. It was fairly random.
Turns out- after a lot of troubleshooting by Microsoft- that turning off anything relating to Virtual Machine Queues/ VMQ settings for the LAN team AND associated physical NICs, and upgrading the Broadcom driver from 16.4 to 16.6 has at least stabilized the situation. From lasting 7-10 days before, cluster has now run for over a month, including a host reboot. Still not 100% confident, but far better.
Incidentally, when chasing Broadcom drivers the Dell package suite names are misleading. Package 18.4 gives you driver 16.4.
My instinct tells me it was the driver at fault- none of the Hyper-V settings had changed since the cluster was built, whereas the driver would almost certainly have been upgraded with an SUU dvd. Can’t prove any of this but anyway the last failure was over a month ago, and it’s carried on working since a complete reboot last Tuesday night.