Skip to main content

Cisco UCS LACP Port Channel Flapping

I recently encountered this issue during a deployment and wasn't able to find much information about it on the Internet, so I figured I'd make a quick blog post to document the issue and the solution in case other people encounter the same issue.

When connecting UCS Fabric Interconnects to non-Cisco switches (in this case Juniper EX series) we noticed some strange behavior: once every 30 minutes or so, the Uplink port-channel members would go down briefly and then come back up and then everything would be fine until the next occurrence. This would generate an intermittent F0727 error - UCS complaining that its port channels had no operational members.

I turned up LACP traceoptions logging on the Juniper side to see if it might be an issue with the LACP protocol itself, but it did not yield any useful information as to the root cause of the issue.

I turned to the UCS logs to see if I could get any further information, and noticed these log entries repeated many times:

2018 Apr 11 13:25:51 ucscluster-A %LLDP-1-NO_DCBX_ACKS_RECV_FOR_LAST_10_PDUs
2018 Apr 11 13:28:07 ucsclster-A %LLDP-1-NO_DCBX_ACKS_RECV_FOR_LAST_10_PDUs
2018 Apr 11 13:30:21 ucscluster-A %LLDP-1-NO_DCBX_ACKS_RECV_FOR_LAST_10_PDUs


A Google search then turned up this Cisco article - https://www.cisco.com/c/en/us/support/docs/switches/nexus-5000-series-switches/116249-troubleshoot-nexus-00.html - which explains that LLDP was the actual root cause of the issue I was seeing.

From the Cisco.com Support article:

"Data Center Bridging Capability Exchange (DCBX) Type Length Values (TLV) are packaged within a Link Layer Discovery Protocol (LLDP) frame that is exchanged between the switch and the converged network adapter (CNA). One such Control Sub-TLV is used for acknowledgement (ACK), which is sequence-based. For example, the switch sends a Control Sub-TLV with a SeqNo of 1 and an AckNo of 2. The host is supposed to inverse this, and send an LLDP frame with a Control Sub-TLV with a SeqNo of 2 and an AckNo of 1. Refer to the Packet Captures section of this article for more details.

The switch expects this exchange from the host every 30 seconds. If the switch does not see this exchange for 100 Protocol Data Units (PDUs) , which is 3000 seconds or 50 minutes, the switch disables with this error."

Okay, so now I knew what the issue was, I started looking around for a way to disable LLDP within UCSM and came up empty. I Googled some more and wasn't able to find anything definitive on how to actually do it. So I finally admitted defeat and opened a Cisco TAC case - and the engineer very quickly responded that the reason I couldn't find a way to disable LLDP on the FIs is because the capability was not exposed via UCSM or the CLI and would have to be done via debug plugin (dplug). He also linked me to an enhancement request that would enable customers to enable/disable LLDP (you'll need a Cisco account to read it):

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCue05053

A 15-minute call later the dplug was loaded and we verified that LLDP was indeed disabled - further monitoring confirmed that disabling LLDP had resolved the flapping portchannels.

An alternative option would be to disable LLDP on the upstream switches instead of the UCS, but I elected to make the configuration change on the UCS side to keep our switches' LLDP configuration standardized.

Hopefully this will help someone else out there having the same issue - thanks for reading!

Comments

Popular posts from this blog

How To: Unjoin NetApp Nodes from a Cluster

Let me paint you a word picture:

You've upgraded to a shiny new AFF - it's all racked, stacked, cabled and ready to rock. You've moved your volumes onto the new storage and your workloads are performing beautifully (of course) and it's time to put your old NetApp gear out to pasture.

We're going to learn how to unjoin nodes from an existing cluster. But wait! There are several prerequisites that must be met before the actual cluster unjoin can be done.


Ensure that you have either moved volumes to your new aggregates or offlined and deleted any unused volumes.Offline and delete aggregates from old nodes.Re-home data LIFs or disable/delete if they are not in use.Disable and delete intercluster LIFs for the old nodes (and remove them from any Cluster Peering relationships)Remove the old node's ports from any Broadcast Domains or Failover Groups that they may be a member of.Move epsilon to one of the new nodes (let's assume nodes 3 and 4 are the new nodes, in th…

ONTAP Configuration Compliance Auditing with PowerShell and Pester

I have been looking for a way to validate NetApp cluster configuration settings (once a configuration setting is set, I want to validate that it was set properly in a programmatic fashion) and prevent configuration drift (if a setting is different than its expected value, I want to know about it). I needed it to be able to scale out to dozens of clusters as well, so it needed to be something that I could run both automatically and on an ad-hoc basis if necessary.

NetApp PowerShell Toolkit

The core of the solution is the NetApp PowerShell Toolkit, without which this would likely not be possible. It contains 2300+ cmdlets for provisioning and managing NetApp storage components. It can be downloaded from the ToolChest on the NetApp MySupport site (with a valid login). You'll find exhaustive documentation there as well for each of the cmdlets along with syntax examples and sample code. It is a fantastic and easy way to automate common storage tasks - we use it in our environment for e…

Step up your HTTP security header game with NetScaler Rewrite Policies

There are a number of HTTP response headers that exist to increase web site security. If set properly, they can ensure that your site is less exposed to many common web vulnerabilities. By no means are these descriptions exhaustive, so I have included some references that can provide a more in-depth explanation at the bottom of each section. I'd also like to give a shout-out to the OWASP Secure Headers Project and Scott Helme of securityheaders.com - thank you!

Note: Screenshots are from a NetScaler VPX 12.1 - if you are running a different version, the screenshots may look different, but the logic is the same. So that I have something to bind these policies to, I've also already created a load-balancing virtual server named lb_web_ssl and a Service Group for two TurnKey LAMP servers on the back-end.

X-Frame-Options
The X-Frame-Options header is designed to guard against clickjacking (an attack where malicious content is hidden beneath a clickable button or element on a web si…