512 IS THE MAGIC NUMBER

There are some things not many network engineers think about in their day to day activities, and watching FIB limits is one of them. But they have that nasty way of telling you that you either have outdated hardware or you're using the wrong device in the wrong place by crashing or black-holing traffic when you least expect it.

Thank you, Verizon!

What happened early-August this year was a wake-up call for those using Cisco 6500s as Internet Border Routers.

As someone in Verizon fudged the aggregation of some of their prefixes, the global routing table all of a sudden found itself having to handle about 15000 additional routes. That took it (not everywhere, as what your ISP announces to you is usually not the 100% full table anyway) to above 512000 routes.

And the follow-up was simple: by default, Cisco's 6500 and 7600 lines program their TCAM to hold a maximum of 512k routes (or less, depending on the age factor). When that limit is reached, there's no room in the FIB for any new routes - and traffic hitting these overflowing prefixes is punted directly to the CPU (yes, that means it's forwarded in software, rather than hardware).

Doing software forwarding on a L3-switch, be it a big boy like the 6500, is disastrous. You might imagine that it doesn't take much to murder its CPU and for the box to either crash or start flapping all of its control plane protocols.

The newer (and supported) supervisors on these platforms support more than 512k IPv4 routes - you can increase how big of a slice of the TCAM pie it allocates, but be warned, it comes at a cost: the TCAM space is shared with IPv6 routes and MPLS labels.

My thank you to Verizon is not 100% sarcasm though: the aggregation issue was quickly fixed and it basically served as a wake-up call to a lot of people to upgrade or change their border router solution.

From routes to ACLs

This second story comes from yours truly having some "fun" with a few Cisco 3550 switches some years ago (amusingly enough, that network was also using 6500s as Internet Border Routers!).

The TCAM on a switch doesn't only hold route information. It also has space, depending on the switch capabilities, for ACLs, QoS, Multicast, IPv6 and MAC addresses. Being finite, there are only so many ACLs you can store in the TCAM to be applied to traffic while forwarding packets in hardware.

When that ACL partition fills up, the same thing happens: the ACLs that didn't fit in are processed by the switch CPU, which means that all packets needing to be inspected by said ACLs will get sent via the CPU instead of being forwarded in hardware.

What happened to me was rather simple: I went to a 3550 access switch and configured a new ACL. Upon applying it to the interface, my telnet session started lagging and dropped. Luckily, I was able to still open the console and see some rather puzzling messages:

Jun 29 13:24:38 FR: %FM-3-UNLOADING: Unloading input vlan label 11 feature from all TCAMs
Jun 29 13:24:38 FR: %QATM-4-TCAM_LOW: TCAM resource running low for table Input ACL, resource type TCAM masks, on TCAM number 1.

What that was telling me was that there was not enough room in the TCAM to fit all of the ACLs and the switch decided to unload another ACL to make room for this one. In Cisco's words: “The probable reason for this is that an ACL, after being optimized by the TCAM merge algorithm, requests more resources than are available for the given template.”

Now, vlan label 11 is not the same thing as the VLAN ID, so finding out which traffic got screwed takes more work. After a bit of frantic digging through Cisco docs, I found the show fm commands:

#show fm vlan-label 11
Input Features:
  Interfaces or VLANs:  Vl149
  Priority: normal
  Bits: NoUnreach NoRedirect
  Vlan Map: (none), 0 VMRs.
  Access Group: acl-vlan149, 15 VMRs.
  Multicast Boundary: (none), 0 VMRs.
Output Features:
  Interfaces or VLANs:
  Priority: low
  Bridge Group Member: no
  Vlan Map: (none), 0 VMRs.
  Access Group: (none), 0 VMRs.

Because Vlan 149 had quite a bit of traffic through it, the really unfortunate consequence was that it ended up being process-switched, taking the switch CPU to 100% and causing packet drops and control-plane flapping.

Again, to fix the problem you have two options: you get better hardware or find a way to tweak your existing one (if at all possible).

On the 3550, 3560 and 3750 you have TCAM templates that allow you to reallocate resources to boost certain functionality. This is done via the SDM (Switching Database Manager) and requires a reboot after any change.

For example, below you can see the default and access templates and the differences between them: while the access template has more room for ACLs (security aces) and multicast, it does so at the expense of mac address and unicast route capacity.

Default SDM Template:
number of unicast mac addresses:   5K
number of igmp groups:             1K
number of qos aces:                1K
number of security aces:           1K
number of unicast routes:          4K
number of multicast routes:        1K

Access SDM Template:
number of unicast mac addresses:   1K
number of igmp groups:             2K
number of qos aces:                1K
number of security aces:           2K
number of unicast routes:          2K
number of multicast routes:        2K

Check, check and triple check

Old or stretched hardware is something that a lot of network engineers have to live with, so make sure you check your devices that might be doing a bit too much before anything horrible happens. Your colleagues and customers will thank you.

And, as always, thanks for reading.


comments powered by Disqus