This week I encountered multiple weird symptoms on a Horizon Pool.
Everything started with the pool throwing many provisioning errors and eventually no more Desktops were in status ‘Available’. Later I noticed that it was not even possible to modify the number of Datastores on the pool (Infrastructure Change) or push a new image. Everything hang and stuck.
Since it is not easy to understand the origin of such problems I will describe in this post step-by-step how such errors can be troubleshooted.
On pool level mainly those error messages have been reported:Provisioning error occurred for Machine <hostname>: Resync operation failed
Provisioning error occurred for Machine <hostname>: Cloning failed for Machine
Instant cloning failed is usually a good indicator that something is odd with the internal Horizon VMs (cp-template and cp-replica VMs). So the first action is to understand the relation of the pools Master Image and its internal Horizon VMs. In large environments there are many of these. So I navigated to the Pool Summary on the Connection Server and gathered the “Golden Image” name.
The iccleanup.cmd utility can help to understand the relation of the Golden Image with its internal Horizon VMs
It is usually located here on the connection server:
C:\Program Files\VMware\VMware View\Server\tools\bin
Syntax to start it:
iccleanup.cmd -vc vcName -uid userId [-skipCertVeri] [-clientId clientUuid]
Once the utility has loaded the ‘list’ command can be executed to retrieve the list of cp-template and cp-replica VMs. Sample output:
27. <Master-VM-Name>/<Snapshot-Name>
cp-template-c09b6d60-26a2-4e24-b056-b45953f5289d
cp-replica-3a6d1c1a-3d64-45ab-a0cd-c64fb69ce3a8
cp-parent-a8be9b73-6b65-4f83-b414-69445d7b3757
cp-parent-73b0f6a0-dae3-4df1-a5f5-845994162edc
cp-parent-036101da-8e16-4ef0-8b81-ad2129b3e0a9
cp-replica-d90ac860-47d3-46eb-bde5-dccf16025e27
cp-parent-a66fc46d-d403-41a8-848b-88203aec299b
cp-parent-46bce3fe-63f8-41fc-a5f7-409d360c4b97
cp-replica-24705be4-3104-4419-b906-d7b72aee9da7
cp-parent-320ff1de-23e2-476e-a735-4997a95e5368
cp-parent-4bs00f66-3888-4ffc-9e97-48485dd4ce2c
I copied the list and run through each cp-parent VM and its Tasks & Events to understand more about the cloning failed problem. But oddly none of the cp-parent VMs contained a recent cloning action (not even a failure that I expected). I have noticed two of the cp-parent VMs were offline and there was not a snapshot on it which is also strange. So apparently the cloning is not even starting..
I wanted to understand if the pool is able to handle other actions such as Infrastructure Change (triggered by changes of datastores mapped to the pool) or push image process. Also I wanted to isolate the issue a little further, so I edited the Pool and removed all datastores apart from 3. This triggered the infra change which is usually a quick activity when it is only about removal. (In the backend it will attempt to remove cp-replica VMs from Datastores that are no more in use). In my case it was everything but not quick. I went to our syslog solution to understand what the task is doing.
My first search was:
<Pool Name> publish* image
With this I retrieved the PendingOperation task:
PendingOperation-vm-1822770:snapshot-1822951-InfrastructureChange
This task stuck at 0% and eventually failed and restarted itself… and guess what: failed again:
2024-02-08T15:10:04.091+01:00 ERROR (1B88-3394) [NgvcHelper] Unprime of pool <Pool Name> has failed. Timed out waiting for operation to complete. Total time waited 30 mins
So I went through the logs of this PendingOperation and noticed some events in which the instanceUUID of the Master VM was posted, so I went on and searched for it to see if there are some related events:
2024-02-08T14:09:43.813+01:00 DEBUG (1B88-3394) [VcCache] Getting virtual machine object for instanceUUID 5009cba1-bfb4-97f4-c957-3b5c7a760b65
And finally I found what I was hoping for, a task that was blocking everything:
2024-02-08T14:49:28.391+01:00 DEBUG (1B88-2F34) [LockManager] Task: ede1be65-08c6-4f41-8088-63681343ded3 is being BLOCKED because the needed lock combination: InternalVmLookupTask:Combination [lockGuid=ImageLinks-master.uuid=5009cba1-bfb4-97f4-c957-3b5c7a760b65&ss.id=1, type=READ] is currently owned by task: 37b6b3d1-8068-4247-83fc-4d7c827299be
I was able to link this task to a vCenter action “DisableDrsHaAction” which was a hanging operation on the Cluster that stuck at 0% greyed out…
So, finally the vCenter was restarted to kill this task. After the restart everything went on in the normal way.
I want to point out that this is not the only reason/trigger for such behaiovur within a Horizon environment. I saw situations in which a internal horizon VM got unprotected (usually they are protected and you cannot modify it) and moved by DRS to other datastores which caused the cloning operations to fail also. Or other situations in which the underlying storage system failed and internal horizon VMs got invalid. I hope this post can give you and idea how to start the troubleshooting when hitting such mess. The key is to have syslogs stored externally and to have debug log information enabled on Horizon side.