Skip to main content

Cisco UCS LACP Port Channel Flapping

I recently encountered this issue during a deployment and wasn't able to find much information about it on the Internet, so I figured I'd make a quick blog post to document the issue and the solution in case other people encounter the same issue.

When connecting UCS Fabric Interconnects to non-Cisco switches (in this case Juniper EX series) we noticed some strange behavior: once every 30 minutes or so, the Uplink port-channel members would go down briefly and then come back up and then everything would be fine until the next occurrence. This would generate an intermittent F0727 error - UCS complaining that its port channels had no operational members.

I turned up LACP traceoptions logging on the Juniper side to see if it might be an issue with the LACP protocol itself, but it did not yield any useful information as to the root cause of the issue.

I turned to the UCS logs to see if I could get any further information, and noticed these log entries repeated many times:

2018 Apr 11 13:25:51 ucscluster-A %LLDP-1-NO_DCBX_ACKS_RECV_FOR_LAST_10_PDUs
2018 Apr 11 13:28:07 ucsclster-A %LLDP-1-NO_DCBX_ACKS_RECV_FOR_LAST_10_PDUs
2018 Apr 11 13:30:21 ucscluster-A %LLDP-1-NO_DCBX_ACKS_RECV_FOR_LAST_10_PDUs


A Google search then turned up this Cisco article - https://www.cisco.com/c/en/us/support/docs/switches/nexus-5000-series-switches/116249-troubleshoot-nexus-00.html - which explains that LLDP was the actual root cause of the issue I was seeing.

From the Cisco.com Support article:

"Data Center Bridging Capability Exchange (DCBX) Type Length Values (TLV) are packaged within a Link Layer Discovery Protocol (LLDP) frame that is exchanged between the switch and the converged network adapter (CNA). One such Control Sub-TLV is used for acknowledgement (ACK), which is sequence-based. For example, the switch sends a Control Sub-TLV with a SeqNo of 1 and an AckNo of 2. The host is supposed to inverse this, and send an LLDP frame with a Control Sub-TLV with a SeqNo of 2 and an AckNo of 1. Refer to the Packet Captures section of this article for more details.

The switch expects this exchange from the host every 30 seconds. If the switch does not see this exchange for 100 Protocol Data Units (PDUs) , which is 3000 seconds or 50 minutes, the switch disables with this error."

Okay, so now I knew what the issue was, I started looking around for a way to disable LLDP within UCSM and came up empty. I Googled some more and wasn't able to find anything definitive on how to actually do it. So I finally admitted defeat and opened a Cisco TAC case - and the engineer very quickly responded that the reason I couldn't find a way to disable LLDP on the FIs is because the capability was not exposed via UCSM or the CLI and would have to be done via debug plugin (dplug). He also linked me to an enhancement request that would enable customers to enable/disable LLDP (you'll need a Cisco account to read it):

https://bst.cloudapps.cisco.com/bugsearch/bug/CSCue05053

A 15-minute call later the dplug was loaded and we verified that LLDP was indeed disabled - further monitoring confirmed that disabling LLDP had resolved the flapping portchannels.

An alternative option would be to disable LLDP on the upstream switches instead of the UCS, but I elected to make the configuration change on the UCS side to keep our switches' LLDP configuration standardized.

Hopefully this will help someone else out there having the same issue - thanks for reading!

Comments

Popular posts from this blog

How To: Unjoin NetApp Nodes from a Cluster

Let me paint you a word picture:

You've upgraded to a shiny new AFF - it's all racked, stacked, cabled and ready to rock. You've moved your volumes onto the new storage and your workloads are performing beautifully (of course) and it's time to put your old NetApp gear out to pasture.

We're going to learn how to unjoin nodes from an existing cluster. But wait! There are several prerequisites that must be met before the actual cluster unjoin can be done.


Ensure that you have either moved volumes to your new aggregates or offlined and deleted any unused volumes.Offline and delete aggregates from old nodes.Re-home data LIFs or disable/delete if they are not in use.Disable and delete intercluster LIFs for the old nodes (and remove them from any Cluster Peering relationships)Remove the old node's ports from any Broadcast Domains or Failover Groups that they may be a member of.Move epsilon to one of the new nodes (let's assume nodes 3 and 4 are the new nodes, in th…

Modernizing a NetApp Certification

Read on to find out how new versions of NetApp exams are written during an Item Development Workshop at NetApp's RTP office
In mid-October, this message popped up in the NetApp United Slack channel from Petya Stefanova, NetApp United's fearless leader:
Hey @channel there’s a new opportunity to participate in an IDW with NetAppU. This time the workshop will be reviewing the two exams for NetApp Certified Data Administrator ONTAP (NCDA, NS0-192) and NetApp Certified Support Engineer ONTAP (NCSE ONTAP, NS0-590), taking place mid-end January. If you are interested, drop me an email how you quality and can contribute to IDW. I need to submit nominations by Friday. So please let me know ASAP! Partners and customers can participate
I immediately knew that it was something that I would be interested in, so I talked to my employer to get their approval and put in my application. At the time, I didn't have any NetApp certifications so I didn't expect to be selected to take part in…

NetApp Cloud Insights Preview, Part 1: Installing the Acquisition Unit

For those of you that aren't familiar with NetApp Cloud Insights, it is an infrastructure monitoring tool that is currently available as a Public Preview. It is designed to provide, well, insight into the often diverse sets of storage and networking components in use across your entire environment - everything from on-premises ONTAP deployments to public cloud offerings from Amazon, Microsoft, and others.

Recently, I registered for the preview and just received my email welcoming me into the preview last week. I am planning a series of posts to cover my experiences with Cloud Insights and share the information with other people for whom a vendor-agnostic SaaS monitoring solution might be a good fit.

As a disclaimer: my experiences will be limited to what is in use in my environment, so the coverage of the public cloud features will likely not be covered in great detail - it will primarily be on-premises ONTAP and VMware monitoring data that I'm looking at.

However, there is a…