Hyper-V replica broker fails to start after being manually shut down

There are numerous suggestions on the web; I found one query on social.technet but even though it was only last night, I’ve lost the url…

Anyway, this query is exactly what fixed it for me. The replica broker was created on DC somedc.mybiz.com. That DC went through a bit of a wobble and was due for decommission anyway, so we demoted it and moved the DC stuff onto a virtual machine.

Turns out that in the registry, there’s a string value somewhere under HKLM\Cluster\Resources\{SOME RANDOM GUID VALUE}\parameters called “CreatingDC”. This was the demoted DC. Edit the value and point it at the new DC – somedc_new.mybiz.com – and the replica broker starts up almost instantly.

UPDATE: Just to satisfy myself that this was the case, I failed the RB over to host 2 (using old DC) and it failed. Modified the above registry key and it started instantly- no reboots or anything.


Windows Server 2012 R2 Hyper-V networking failures

We’ve been having intermittent failures of our Hyper-V cluster recently, and it was, bizarrely, getting worse with time. 2 x PowerEdge R720s, 2 x PowerConnect 6224 switches (stacked), 2 x EqualLogic PS4100E boxes. The network on one node would fail completely- could be the LAN team, or the LiveMigration, or CSV etc. It was fairly random.

Turns out- after a lot of troubleshooting by Microsoft- that turning off anything relating to Virtual Machine Queues/ VMQ settings for the LAN team AND associated physical NICs, and upgrading the Broadcom driver from 16.4 to 16.6 has at least stabilized the situation. From lasting 7-10 days before, cluster has now run for over a month, including a host reboot. Still not 100% confident, but far better.

Incidentally, when chasing Broadcom drivers the Dell package suite names are misleading. Package 18.4 gives you driver 16.4.

My instinct tells me it was the driver at fault- none of the Hyper-V settings had changed since the cluster was built, whereas the driver would almost certainly have been upgraded with an SUU dvd. Can’t prove any of this but anyway the last failure was over a month ago, and it’s carried on working since a complete reboot last Tuesday night.

Windows Server 2012 Hyper-V guest clustering

Right. Having previously said how wonderful Microsoft clustering was, I did hit a bit of a wall in trying to cluster a couple of guests outside of the clustered roles on the hosts- i.e. my guests were not clustered virtual machines on the Server 2012 Hyper-V hosts, they were just normal Hyper-V guests.

The cluster- with node1 on host1– would create fine. Try adding node2 on host2… and the join would fail. Try creating the cluster with node1 on host1 and node2 on host2 and the cluster wouldn’t even create. Both scenarios reported a timeout. There’re very few solutions on the web about this, but eventually a TechNet social post pointed to this article: http://support.microsoft.com/kb/2872325 which works a treat. This took a good week or two to find, but once I’d un-bound the filtering protocol my cluster would create quite happily.

Out of interest, I re-bound the filtering protocol on the vEthernet ports on both hosts and tried re-creating the cluster with re-built guests running as Hyper-V clustered roles- still on different hosts- and this also worked instantly (I’d left the “Add all available storage” box ticked and- because it was the only iSCSI disk- picked it up and turned it into the quorum as part of the install).

I guess the answer is, just make all your VMs highly available…

Windows Server 2012 Hyper-V clustering basics.

Right. First of thanks to Microsoft for making clustering way easier. However, there’s very little info about Server 2012- most of what you can easily find is all about Server 2008/ 2008 R2, which is (obviously) quite outdated.

Anyway, start at the start. I’m not saying any of this is right- or recommended, by the book etc- but it works:

  • When thinking about clustering Hyper-V, don’t get swayed by the Hyper-V side. You can’t just start clustering Hyper-V virtual machines. You need to start by clustering the Hyper-V hosts;
  • To do this successfully, you need at least 4 networks: CSV (Cluster Shared Volume), Live Migration, LAN, & iSCSI (ideally paired against different network cards for resilience);
  • You’ll need to install both Hyper-V & clustering roles on each host, AND the associated powershell cmdlet sets (for ease of use);
  • This is important: I’m basing this on EqualLogic SANs with the Dell Host Integration Tools kit. This seems to be important, because (I’m guessing) if you use Microsoft’s own iSCSI software, they recommend using multiple iSCSI subnets. After completely rebuilding the network infrastructure to accommodate this, it turns out the HIT kit really doesn’t like this set up so I’ve reverted back to a single iSCSI VLAN + subnet. Complete pain but at least it’s working again. With the HIT kit installed, run iscsicpl.exe and switch to the “Dell EqualLogic MPIO” tab. You should have 4 connections against every disk, unless the host you’re looking at doesn’t own the “Cluster Group” group in which case you’ll have 2 connections for the quorum disk. This will increase to 4 connections as soon as that host owns the Quorum. You’ll only ever have 4 connections to the quorum on the host that owns the cluster group;
  • To install clustering, PowerShell “Install-WindowsFeature Failover-Clustering” and “Install-WindowsFeature RSAT-Clustering-PowerShell” (I know you can chain these together);
  • To create the cluster, open up “Failover Cluster Manager”, right-click “Failover Cluster Manager” and choose “Create Cluster…”;
  • Follow the wizard through, for the time being only tick and assign an IP address to your preferred CSV/ Cluster and LAN networks;
  • Do not tick “Add all available storage to the cluster”;
  • Run the validation tool- hopefully everything is green-ticked;
  • Finish creating the cluster;
  • Once your cluster is created, right-click on “Disks” and choose “Add Storage”. At the next screen, pick just your Quorum disk;
  • Once your Quorum disk shows up as “Available Storage”, right-click the cluster itself , choose “More Actions”, pick “Configure Cluster Quorum Settings…” and then at the next screen choose “Add or change the quorum witness” then choose your Quorum disk;
  • Now you can add additional iSCSI LUNs as disks, right-click on them and “add to Cluster Shared Volumes” (or something similar);
  • At this point rename all the disks and networks to something useful, otherwise you’ll never know what you’re pointing at;
  • To configure the networks, make sure that:
  • Your preferred CSV/ cluster network is accessible only to the cluster, not clients;
  • Your preferred LAN network is accessible to the cluster AND clients;
  • Your preferred LiveMigration network is accessible only to the cluster, not clients;
  • Your preferred iSCSI network isn’t available to the cluster at all;
  • At this point, right-click on the “Networks” group and choose “Live Migration Settings…”;
  • De-select all networks apart from your preferred LiveMigration network”
  • The networks are assigned-allegedly- according to this article. My experience is that this is right. With a bit of luck, if you now PowerShell “Get-ClusterNetwork -cluster XXXX | Sort-Object Metric | FT Name, Metric, AutoMetric” you’ll see that the CSV/ Cluster network has the lowest metric, followed by the LiveMigration network, then iSCSI, then LAN.
  • This is the point at which you can start creating Hyper-V machines on those shared LUNs. So… instead of creating the machine on C:, or E: you create it on \\host\c$\ClusterStorage\VolumeX. This looks like local storage, but of course it’s just a mount point to an iSCSI LUN. Which is why it’s so easy to move virtual machines, because they don’t exist on any host;
  • If you need your VMs to move with the cluster nodes, you need to add each VM as a resilient VM inside the clustering tool. This then renders the Hyper-V management tools ineffective against the chosen machines as they are now a cluster resource, not a Hyper-V machine;
  • Also… you may get a lot of errors (event ID 1196) about failing to register the DNS name on some networks. I don’t know this for definite, but if you look at ipconfig all the private IP addresses have DNS servers set pointing to presumably imaginary IPv6 DNS addresses. I’m going to remove all the DNS servers from private ranges and see what happens.
  • Okay… I think (hope) the way around the 1196 errors is to use “Set-DnsClient -InterfaceAlias X -RegisterThisConnectionsAddress $False. I’m hoping this stops it trying to register with the false DNS servers, as there appears to be no way of setting DNS servers to be blank.

When Server 2012 DeDuplication goes bad

This is a really weird one.

Our old SAN was really struggling with space despite having EMC DeDuplication switched on, so I commissioned a Server 2012 guest on some new SAN storage, and started serving files through this guest off a dynamic VHDX file (sat on another chunk of the same storage). Because our old SAN used NDMP backups, I had to use RoboCopy to migrate the data which is slow but did the job. The reason for the intermediate VHDX is simply for portability- we could pick it up off the SAN and dump it on any Windows server (even stand-alone), whereas obviously anything stored directly on the SAN would need to be connected up to servers on the iSCSI network. The VHDX file is connected to the guest on a SCSI bus as this enables hot-removal unlike the IDE buses.

We’d also been waiting for Backup Exec 2010 R3 SP3, as this was the first release of 2010 that at least enabled the agent to work on Server 2012.

This was all good so far; RoboCopy had- as expected- kept all the NTFS permissions correct so it was just a case of sharing the folders out. As both shares now also sat on the same volume, I though Windows Server 2012 DeDuplication could really get to work by DeDuping across the shares, which previously hadn’t been possible due to the configuration. The DeDupe process started slowly but I didn’t think anything of this, because the Celerra took ages to fully DeDupe all its volumes. I reasoned that this should be fine as it was the Hyper-V guest doing the DeDuplication; I presumed that the guest should see the VHDX as just another block of storage rather than trying to DeDuplicate the VHDX file from a host machine.

This is where it all went a bit strange. Nobody had complained about access speeds or anything, yet Backup Exec was taking absolutely ages to back up this volume (I did some rough maths and figured out it would actually never complete a full backup inside a week, compared to “just” taking 48 hours or so previously). Then it turned out that actually, it was doing full backups in less than an hour because it wasn’t actually doing full backups.

This is when it became apparent that there is a clash between the presentation layer of Server 2012 (what I would normally refer to as the Explorer shell, but this is Core), DeDuplication and dynamic VHDX files. The storage system within Windows knows how much actual data is on the volume because it reports x TB used, correctly. DeDuplication seems to go mad and actually un-DeDuplicate everything, so you end up with a space saving of 0bytes (from 11GB… so it’s gone backwards). And the… “explorer” shell goes further and actually loses all reason, reporting insanely small “size on disk” numbers for vast amounts of data (real-world example: 5.1TB of actual data takes up just 37GB on disk. Yeah, right). So to be fair, this is why Backup Exec is being so erratic- it’s asking Windows for the amount of data, and Windows replies that although there’s supposedly 5TB of data, it’s only taking up 37GB on disk.

The fix is currently unknown, as apparently even Microsoft haven’t encountered this one. I’ve stopped the actual DeDupe jobs (even though it’s still enabled against the disk) and set a RoboCopy job off to rehydrate (fingers crossed) everything onto yet another volume (this time pass-through to the new SAN) so that at least we can start getting valid backups. We’re passing information on to Microsoft to see if they can come up with anything, and have discovered that this is perfectly repeatable: enable DeDupe against another (live but non-backed up) dynamic VHDX and the same thing happens, Windows reports the actual amount of data as X but the size on disk as X-a-lot-of-space.

Update 02-NOV-2013: still unresolved. Found a few discrepancies in NTFS permissions but this doesn’t explain much. The data seems to rehydrate onto another volume, but that’s maybe the wrong word as “rehydrate” implies it was deduplicated in the first place, which it wasn’t really. Still, at least we can get some form of backup for now.

Live VHDX extension struggle

Given we had a server with a CIFS share that was fast approaching its limit (on the file server, not just the share) I built a Hyper-v machine around two iSCSI shares: one held the Windows guest, the other held the CIFS data. Now I knew the data was going to be about 4TB so created the CIFS share at that size, but because we’ve decided to use VHDX as the “visible” storage medium (rather than pass-through, purely for portability) I thought I’d start off with a 1TB VHDX file and then expand it “on the fly” to see how easy it was. I started RoboCopying all the data on the above CIFS share to the new VHDX, knowing it would start failing to copy at the 1TB mark.

And the answer was incredibly easy, if only I’d connected the 1TB VHDX to the guest’s SCSI connector first time. Basically, it’s impossible to disconnect an IDE drive while the VM is live, but no such problem with SCSI (should have realised this).

Step 1 was to use DISKPART on the guest to set disk 1 offline, then on the host ran Remove-VMHardDiskDrive -VMName xxxx -ControllerType SCSI -ControllerNumber 0 -ControllerLocation 0 (no path for Remove-… as it’s only removing a disk from the guest). Next, I used a Resize-VHD script (possibly pointlessly as it was a dynamic VHDX, but it worked…or at least didn’t fail), used Add-VMHardDiskDrive -VMName xxxx -ControllerType SCSI -ControllerNumber 0 -ControllerLocation 0 -Path X:\xxxx.vhdx. I could then use DISKPART on the guest to switch the disk back online, extend it with server manager (on another machine- the storage server is Server Core) to- again deliberately- just 3.5TB to put limits on the copy process for experimentation. As soon as the extend had finished, the RoboCopy process kicked back in. All that’s left now is for the Windows DeDuplication to sort out the copied data and we should be ready to migrate over (I’m not sure how much it can compress the data given that it’s all unique images, which would already be considerably compressed through the jpg format etc. Currently running at 2% deduplication..)