Bug Alert: VIFs get deleted when adding Host to VDS on NSX 4.1.1 in security-only deployment

NSX in security-only deployment has a long history of critical bugs such as “NSX-T DFW rules are not applied to VMs in security only environments” (https://kb.vmware.com/s/article/91390) but recently I came accross a new bug that is larger than anything that has happened before on the security-only deployment:

Adding a new Host without NSX installed to a VDS on NSX 4.1.1 while there are NSX prepared and at the same time not NSX prepared Clusters attached to the same VDS can cause a rare race-condition. The cleanup task, which is supposed to run on the cluster or standalone Host that isn’t prepared for NSX yet, mistakenly sees the TNP and TZ of of other clusters sharing the VDS as stales and tries to delete them.

Ultimately this can lead to a situation in which all VIFs (virtual interfaces) are getting deleted and therefore Firewall-Rules can no more be applied which results in a fallback to the default rule: deny any-any.

On NSX side basically all ports are gone. To confirm the symptoms also above mentioned KB is useful because it contains the necessary instructions to gather information about the applied FW rules.

summarize-dvfilter | grep -A 9 <vm-name.eth0>

vsipioctl getrules -f <nic name>

Btw, this bug is not present when overlay-networking is in use.

The fastest resolution path to establish the VIFs and connectivity again is to vMotion the affected VMs between the Hosts. This will re-create the deleted items and local control plane information.

A patch for this bug is not available yet. But is expected soon.

Here below are some syslog examples taken from a lab that could indicate such problem.

<193>2024-02-07T09:14:59.069Z esxi01.lab.local nsx-opsagent[2102003]: NSX 2102003 - [nsx@3987 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="4290638" level="INFO"] [DoVifPortOperation] request=[opId:[MP-initiated-detach-1717293235760] op:[MP_DETACH_PORT(1003)] vif:[] ls:[odh34b15-34c3-4a58-89d8-b64fdb5da67h] vmx:[] lp:[610a9cc0-c7d4-45d2-877e-e9e3d1542125]]
<193>2024-02-07T09:14:59.683Z esxi01.lab.local cfgAgent[2101722]: NSX 2101722 - [nsx@3987 comp="nsx-controller" subcomp="cfgAgent" tid="33234501" level="info"] Delete logical switch [b7b1849b-0fb8-4141-9986-bd8548ebf61e]
<193>2024-02-07T09:14:59.683Z esxi01.lab.local cfgAgent[2101722]: NSX 2101722 - [nsx@3987 comp="nsx-controller" subcomp="cfgAgent" tid="33234501" level="warn"] LSP_SWITCH_ID is not found for port [c5743555-dc17-53d7-b297-b28d57bd6c08]
<193>2024-02-07T09:14:59.683Z esxi01.lab.local cfgAgent[2101722]: NSX 2101722 - [nsx@3987 comp="nsx-controller" subcomp="cfgAgent" tid="33234501" level="info"] l2: invalid routingDomainID for LS b7b1849b-0fb8-4141-9986-bd8548ebf61e, skip deleting
<193>2024-02-07T09:15:25.378Z esxi01.lab.local nsx-opsagent[2102003]: NSX 2102003 - [nsx@3987 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="4290638" level="WARNING"] [PortOp] Port [3d32d347-ab49-4d45-cb32-dfa0ge91554f] state get failed, error code [bad0003], skip clearing VIF