I recently encountered this issue during a deployment and wasn't able to find much information about it on the Internet, so I figured I'd make a quick blog post to document the issue and the solution in case other people encounter the same issue.
When connecting UCS Fabric Interconnects to non-Cisco switches (in this case Juniper EX series) we noticed some strange behavior: once every 30 minutes or so, the Uplink port-channel members would go down briefly and then come back up and then everything would be fine until the next occurrence. This would generate an intermittent F0727 error - UCS complaining that its port channels had no operational members.
I turned up LACP traceoptions logging on the Juniper side to see if it might be an issue with the LACP protocol itself, but it did not yield any useful information as to the root cause of the issue.
I turned to the UCS logs to see if I could get any further information, and noticed these log entries repeated many times:
2018 Apr 11 13:25:51 ucscluster-A %LLDP-1-NO_DCBX_ACKS_RECV_FOR_LAST_10_PDUs
2018 Apr 11 13:28:07 ucsclster-A %LLDP-1-NO_DCBX_ACKS_RECV_FOR_LAST_10_PDUs
2018 Apr 11 13:30:21 ucscluster-A %LLDP-1-NO_DCBX_ACKS_RECV_FOR_LAST_10_PDUs
A Google search then turned up this Cisco article - https://www.cisco.com/c/en/us/support/docs/switches/nexus-5000-series-switches/116249-troubleshoot-nexus-00.html - which explains that LLDP was the actual root cause of the issue I was seeing.
From the Cisco.com Support article:
"Data Center Bridging Capability Exchange (DCBX) Type Length Values (TLV) are packaged within a Link Layer Discovery Protocol (LLDP) frame that is exchanged between the switch and the converged network adapter (CNA). One such Control Sub-TLV is used for acknowledgement (ACK), which is sequence-based. For example, the switch sends a Control Sub-TLV with a SeqNo of 1 and an AckNo of 2. The host is supposed to inverse this, and send an LLDP frame with a Control Sub-TLV with a SeqNo of 2 and an AckNo of 1. Refer to the Packet Captures section of this article for more details.
The switch expects this exchange from the host every 30 seconds. If the switch does not see this exchange for 100 Protocol Data Units (PDUs) , which is 3000 seconds or 50 minutes, the switch disables with this error."
Okay, so now I knew what the issue was, I started looking around for a way to disable LLDP within UCSM and came up empty. I Googled some more and wasn't able to find anything definitive on how to actually do it. So I finally admitted defeat and opened a Cisco TAC case - and the engineer very quickly responded that the reason I couldn't find a way to disable LLDP on the FIs is because the capability was not exposed via UCSM or the CLI and would have to be done via debug plugin (dplug). He also linked me to an enhancement request that would enable customers to enable/disable LLDP (you'll need a Cisco account to read it):
https://bst.cloudapps.cisco.com/bugsearch/bug/CSCue05053
A 15-minute call later the dplug was loaded and we verified that LLDP was indeed disabled - further monitoring confirmed that disabling LLDP had resolved the flapping portchannels.
An alternative option would be to disable LLDP on the upstream switches instead of the UCS, but I elected to make the configuration change on the UCS side to keep our switches' LLDP configuration standardized.
Hopefully this will help someone else out there having the same issue - thanks for reading!
When connecting UCS Fabric Interconnects to non-Cisco switches (in this case Juniper EX series) we noticed some strange behavior: once every 30 minutes or so, the Uplink port-channel members would go down briefly and then come back up and then everything would be fine until the next occurrence. This would generate an intermittent F0727 error - UCS complaining that its port channels had no operational members.
I turned up LACP traceoptions logging on the Juniper side to see if it might be an issue with the LACP protocol itself, but it did not yield any useful information as to the root cause of the issue.
I turned to the UCS logs to see if I could get any further information, and noticed these log entries repeated many times:
2018 Apr 11 13:25:51 ucscluster-A %LLDP-1-NO_DCBX_ACKS_RECV_FOR_LAST_10_PDUs
2018 Apr 11 13:28:07 ucsclster-A %LLDP-1-NO_DCBX_ACKS_RECV_FOR_LAST_10_PDUs
2018 Apr 11 13:30:21 ucscluster-A %LLDP-1-NO_DCBX_ACKS_RECV_FOR_LAST_10_PDUs
A Google search then turned up this Cisco article - https://www.cisco.com/c/en/us/support/docs/switches/nexus-5000-series-switches/116249-troubleshoot-nexus-00.html - which explains that LLDP was the actual root cause of the issue I was seeing.
From the Cisco.com Support article:
"Data Center Bridging Capability Exchange (DCBX) Type Length Values (TLV) are packaged within a Link Layer Discovery Protocol (LLDP) frame that is exchanged between the switch and the converged network adapter (CNA). One such Control Sub-TLV is used for acknowledgement (ACK), which is sequence-based. For example, the switch sends a Control Sub-TLV with a SeqNo of 1 and an AckNo of 2. The host is supposed to inverse this, and send an LLDP frame with a Control Sub-TLV with a SeqNo of 2 and an AckNo of 1. Refer to the Packet Captures section of this article for more details.
The switch expects this exchange from the host every 30 seconds. If the switch does not see this exchange for 100 Protocol Data Units (PDUs) , which is 3000 seconds or 50 minutes, the switch disables with this error."
Okay, so now I knew what the issue was, I started looking around for a way to disable LLDP within UCSM and came up empty. I Googled some more and wasn't able to find anything definitive on how to actually do it. So I finally admitted defeat and opened a Cisco TAC case - and the engineer very quickly responded that the reason I couldn't find a way to disable LLDP on the FIs is because the capability was not exposed via UCSM or the CLI and would have to be done via debug plugin (dplug). He also linked me to an enhancement request that would enable customers to enable/disable LLDP (you'll need a Cisco account to read it):
https://bst.cloudapps.cisco.com/bugsearch/bug/CSCue05053
A 15-minute call later the dplug was loaded and we verified that LLDP was indeed disabled - further monitoring confirmed that disabling LLDP had resolved the flapping portchannels.
An alternative option would be to disable LLDP on the upstream switches instead of the UCS, but I elected to make the configuration change on the UCS side to keep our switches' LLDP configuration standardized.
Hopefully this will help someone else out there having the same issue - thanks for reading!