VMware NSX Manager application crash with Out of Memory due to large ApiTracker table on corfu db

Recently we were facing a problem on NSX control plane that suddenly crashed and became degraded. Initially we were thinking this could be solved by restarting the manager which was showing the problem. However, after the restart the problem moved to one of the other two controllers. We went on with other restarts of the entire control plane but it did not work. We noticed that the affected service is ‘Proton’ and found an OOM heap dump automatically created after the service crash:

ls /image/core
-rw-------  1 uproton uproton   37 Sep 13 21:41 proton_oom.hprof.gz

Latest after 15 minutes the problem appeared again so we suspected a memory leak and opened a SR with Broadcom. After rather short time it was discovered that ApiTracker table on corfu DB (corfu is a distributed db system built around a shared log abstraction among the NSX controllers) has had more than 10 million rows. Actually it is supposed to contain round about 1000 entries. We spent good amount of time to understand the reason since there should be an automatic cleanup job in place. This is how you can check the size of ApiTracker table:

tail -f /var/log/corfu/corfu-compactor-audit.log

Look for such output, especially for number of entries:

2024-09-13T23:23:45.020Z | INFO  |              Cmpt-chkpter-9000 |   o.c.runtime.CheckpointWriter | appendCheckpoint: completed checkpoint for 84fde90f-0df5-305f-371b-ce8b5ca4267b, entries(10030616), cpSize(2228864537) bytes at snapshot Token(epoch=1518, sequence=8128115261) in 3685860 ms

This entries are increasing because due to a bug a different table (EntityDeletionMarker) contains invalid entries. This can be if you upgraded the environment from older versions. GSS TSEs identified that this is preventing the execution of an automatic cleanup job of ApiTracker table.
This is how you can check if there are invalid entries on EntityDeletionMarker:

./corfu_tool_runner.py --tool corfu-browser -n nsx -o showTable -t EntityDeletionMarker

After cleaning it up (do it only with GSS!) we were optimistic that the automatic job would kick in to clear the ApiTracker table.

./corfu_tool_runner.py -n nsx -t EntityDeletionMarker -o clearTable

We were waiting two iterations (the job is starting every 15 minutes), however we noticed after 50 minutes it crashed and the node was again out of memory. So there was no other way than to increase the memory size of the controllers. This is unsupported and should be done only as last resort with GSS assistance. The memory can be increased by modifying this file:

/usr/tanuki/conf/proton-tomcat-wrapper_48gb.conf

And afterwards the controller need to be increased on virtual Machine level. In our case we put the Java Hep Size from 12% to 19.8%.

Finally, this fixed the issue and let the job complete the cleanup.

2024-09-13T22:22:19.068Z | INFO  |              Cmpt-chkpter-9000 |     o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$ApiTracker id 84fde90f-0df5-305f-371b-ce8b5ca4267b
2024-09-13T22:22:19.158Z | INFO  |              Cmpt-chkpter-9000 |   o.c.runtime.CheckpointWriter | appendCheckpoint: Started checkpoint for 84fde90f-0df5-305f-371b-ce8b5ca4267b at snapshot Token(epoch=1518, sequence=8128115261)
2024-09-13T23:23:45.020Z | INFO  |              Cmpt-chkpter-9000 |   o.c.runtime.CheckpointWriter | appendCheckpoint: completed checkpoint for 84fde90f-0df5-305f-371b-ce8b5ca4267b, entries(10030616), cpSize(2228864537) bytes at snapshot Token(epoch=1518, sequence=8128115261) in 3685860 ms
2024-09-14T02:03:37.319Z | INFO  |              Cmpt-chkpter-9000 |     o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$ApiTracker id 84fde90f-0df5-305f-371b-ce8b5ca4267b
2024-09-14T02:03:37.470Z | INFO  |              Cmpt-chkpter-9000 |   o.c.runtime.CheckpointWriter | appendCheckpoint: Started checkpoint for 84fde90f-0df5-305f-371b-ce8b5ca4267b at snapshot Token(epoch=1528, sequence=8130630447)
2024-09-14T02:18:35.227Z | INFO  |              Cmpt-chkpter-9000 |   o.c.runtime.CheckpointWriter | appendCheckpoint: completed checkpoint for 84fde90f-0df5-305f-371b-ce8b5ca4267b, entries(180000), cpSize(41796720) bytes at snapshot Token(epoch=1528, sequence=8130630447) in 897757 ms
2024-09-14T02:36:41.732Z | INFO  |              Cmpt-chkpter-9000 |     o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$ApiTracker id 84fde90f-0df5-305f-371b-ce8b5ca4267b
2024-09-14T02:36:41.840Z | INFO  |              Cmpt-chkpter-9000 |   o.c.runtime.CheckpointWriter | appendCheckpoint: Started checkpoint for 84fde90f-0df5-305f-371b-ce8b5ca4267b at snapshot Token(epoch=1528, sequence=8130913177)
2024-09-14T02:36:43.122Z | INFO  |              Cmpt-chkpter-9000 |   o.c.runtime.CheckpointWriter | appendCheckpoint: completed checkpoint for 84fde90f-0df5-305f-371b-ce8b5ca4267b, entries(839), cpSize(209764) bytes at snapshot Token(epoch=1528, sequence=8130913177) in 1281 ms
2024-09-14T02:51:30.477Z | INFO  |              Cmpt-chkpter-9000 |     o.c.runtime.view.SMRObject | ObjectBuilder: open Corfu stream nsx$ApiTracker id 84fde90f-0df5-305f-371b-ce8b5ca4267b

Bug Alert: VIFs get deleted when adding Host to VDS on NSX 4.1.1 in security-only deployment

NSX in security-only deployment has a long history of critical bugs such as “NSX-T DFW rules are not applied to VMs in security only environments” (https://kb.vmware.com/s/article/91390) but recently I came accross a new bug that is larger than anything that has happened before on the security-only deployment:

Adding a new Host without NSX installed to a VDS on NSX 4.1.1 while there are NSX prepared and at the same time not NSX prepared Clusters attached to the same VDS can cause a rare race-condition. The cleanup task, which is supposed to run on the cluster or standalone Host that isn’t prepared for NSX yet, mistakenly sees the TNP and TZ of of other clusters sharing the VDS as stales and tries to delete them.

Ultimately this can lead to a situation in which all VIFs (virtual interfaces) are getting deleted and therefore Firewall-Rules can no more be applied which results in a fallback to the default rule: deny any-any.

On NSX side basically all ports are gone. To confirm the symptoms also above mentioned KB is useful because it contains the necessary instructions to gather information about the applied FW rules.

summarize-dvfilter | grep -A 9 <vm-name.eth0>

vsipioctl getrules -f <nic name>

Btw, this bug is not present when overlay-networking is in use.

The fastest resolution path to establish the VIFs and connectivity again is to vMotion the affected VMs between the Hosts. This will re-create the deleted items and local control plane information.

A patch for this bug is not available yet. But is expected soon.

Here below are some syslog examples taken from a lab that could indicate such problem.

<193>2024-02-07T09:14:59.069Z esxi01.lab.local nsx-opsagent[2102003]: NSX 2102003 - [nsx@3987 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="4290638" level="INFO"] [DoVifPortOperation] request=[opId:[MP-initiated-detach-1717293235760] op:[MP_DETACH_PORT(1003)] vif:[] ls:[odh34b15-34c3-4a58-89d8-b64fdb5da67h] vmx:[] lp:[610a9cc0-c7d4-45d2-877e-e9e3d1542125]]
<193>2024-02-07T09:14:59.683Z esxi01.lab.local cfgAgent[2101722]: NSX 2101722 - [nsx@3987 comp="nsx-controller" subcomp="cfgAgent" tid="33234501" level="info"] Delete logical switch [b7b1849b-0fb8-4141-9986-bd8548ebf61e]
<193>2024-02-07T09:14:59.683Z esxi01.lab.local cfgAgent[2101722]: NSX 2101722 - [nsx@3987 comp="nsx-controller" subcomp="cfgAgent" tid="33234501" level="warn"] LSP_SWITCH_ID is not found for port [c5743555-dc17-53d7-b297-b28d57bd6c08]
<193>2024-02-07T09:14:59.683Z esxi01.lab.local cfgAgent[2101722]: NSX 2101722 - [nsx@3987 comp="nsx-controller" subcomp="cfgAgent" tid="33234501" level="info"] l2: invalid routingDomainID for LS b7b1849b-0fb8-4141-9986-bd8548ebf61e, skip deleting
<193>2024-02-07T09:15:25.378Z esxi01.lab.local nsx-opsagent[2102003]: NSX 2102003 - [nsx@3987 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="4290638" level="WARNING"] [PortOp] Port [3d32d347-ab49-4d45-cb32-dfa0ge91554f] state get failed, error code [bad0003], skip clearing VIF